discussion: qualitative designs

Rose2015

Newman2016.pdf

Home >Psychology homework help >discussion: qualitative designs

8/11/20, 12(37 PMPrint

Page 1 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

2.2 Reliability and Validity Each of the three types of research designs—descriptive, correlational, and experimental—has the same basic goal: to take a hypothesis about some phenomenon and translate it into measurable and testable terms. That is, whether researchers use a descriptive, correlational, or experimental design to test predictions about income and happiness, they still need to translate (or operationalize) the concepts of income and happiness into measures that will be meaningful for the study. Unfortunately, the sad truth is that research measurements will always be influenced by factors in addition to the conceptual variable of interest. Answers to any set of questions about happiness will depend both on actual levels of happiness and the ways people interpret the questions. The meditation experiment may have different effects, depending on people’s experience with meditation. Even describing the percentage of Republicans voting for independent candidates will vary according to characteristics of a particular candidate.

These additional sources of influence can be grouped into two categories: random and systematic errors. Random error involves chance fluctuations in measurements, such as when a participant misunderstands the question, or shows up in a terrible mood after walking through a blizzard to get to the study. Although random errors can influence measurement, they generally cancel out over the span of an entire sample. That is, some people may overreact to a question while others underreact. Heavy snowfall might put one person in a terrible mood and make another appreciate the joy of winter. While both of these examples would add error to a dataset, they would cancel each other out in a sufficiently large sample.

Systematic errors, in contrast, are those that systematically increase or decrease along with values of the measured variable. For example, people who have more experience with meditation may consistently show more improvement in a meditation experiment than those with less experience. Or, people who have higher self-esteem may score higher on a measure of happiness than those with lower self-esteem. In this case, the happiness scale does not do a good job of homing in on the concept of “happiness” and will end up instead assessing a combination of happiness and self-esteem. These types of errors can cause more serious trouble for a researcher’s hypothesis tests because they interfere with the attempts to understand the link between two variables.

In sum, the measured values of a variable reflect a combination of the true score, random error, and systematic error, as the following conceptual equation shows:

Measured Score = True Score + (Random Error + Systematic Error)

For example:

Happiness Score = Actual Happiness + (Misreading the Question + Self-Esteem)

So, if our measurements are also affected by outside influences, how do we know whether our measures are meaningful? Occasionally, the answer to this question is straightforward; if we ask people to report their weight or their income level, these values can be verified using objective sources. Many research questions within psychology, however, involve more ambiguity. How do we know that our happiness scale is accurate? The problem is that we have no way to objectively verify happiness beyond people’s self-reports of their own happiness. What researchers need, then, are ways to assess how close they are to measuring happiness in a meaningful way. This assessment involves two related concepts: reliability, or the consistency of a measure; and validity, or the accuracy of a measure. This section examines both of these concepts in detail.

Reliability

The consistency of time measurement by watches, cell phones, and clocks reflects a high degree of reliability. People think

8/11/20, 12(37 PMPrint

Page 2 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

of a watch as reliable when it keeps track of the time consistently—an hour should take the same amount of time to pass, 24 times per day. Likewise, the scale is reliable when it gives the same value for weight in back-to-back measurements—an individual’s weight should be the same if he steps off the scale and right back on, provided he stays away from the fridge in the meantime.

Reliability is defined as the extent to which a measured variable is free from random errors, and it is best understood as the degree of consistency in research measurements. As the chapter discussed previously, researchers’ measures are never perfect, and five main sources of random error threaten reliability:

1. Transient states, or temporary fluctuations in participants’ cognitive or mental state; for example, some participants may complete a study after an exhausting midterm or after a fight with their significant others.

2. Stable individual differences among participants; for example, some participants are habitually more motivated or happier than other participants.

3. Situational factors in the administration of the study; for example, an experiment conducted in the early morning may make everyone tired or grumpy.

4. Bad measures that add ambiguity or confusion to the measurement; for example, participants may respond differently to a question about “the kinds of drugs you are taking.” Some may take this to mean illegal drugs, whereas others interpret it as prescription or over-the-counter drugs.

5. Mistakes in coding responses during data entry; for example, a handwritten “7” could be mistaken for a “4.” (Happily, these types of errors have been minimized by the increasing role of computers in data collection. If someone clicks the number “7” in an online survey, the computer will record it as a “7” almost every time.)

Researchers naturally want to minimize the influence of all of these sources of error, and the text will touch on techniques for doing so throughout. However, researchers are also resigned to the fact that all measurements contain a degree of error. The goal, then, is to develop an estimate of how reliable measures are. Researchers generally estimate reliability in three ways.

1. Test–retest reliability refers to the consistency of the measure over time—much like the examples of a reliable watch and a reliable scale. A fair number of research questions in the social and behavioral sciences involve measuring stable qualities. For example, if someone were to design a measure of intelligence or personality, both of these characteristics should be relatively stable over time. An individual score on an intelligence test today should be roughly the same as the score when tested again in five years. A person’s level of extroversion today should correlate highly with his or her level of extroversion in 20 years. The test–retest reliability of these measures is quantified by simply correlating measures at two time points. The higher these correlations are, the higher the reliability will be. This makes conceptual sense as well; if measured scores reflect the true score more than they reflect random error, then this will result in increased stability of the measurements.

2. Inter-item reliability refers to the internal consistency among different items on a measure. Think back to the last time you completed a survey. Did it seem to ask the same questions more than once? (Chapter 4 [4.1] will discuss this technique.) The repetition is included because a single item is more likely to contain measurement error than the average of several items will—remember that small random errors tend to cancel out each other. Consider the following items from Sheldon Cohen’s Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983):

In the last month, how often have you felt that you were unable to control the important things in your life? In the last month, how often have you felt confident about your ability to handle your personal problems? In the last month, how often have you felt that things were going your way? In the last month, how often have you felt difficulties were piling up so high that you could not overcome them?

Each of these items taps into the concept of feeling “stressed out,” or overwhelmed by the demands of life. One standard way to evaluate a measure like this is by computing the average correlation between each pair of items, a

8/11/20, 12(37 PMPrint

Page 3 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

statistic referred to as Cronbach’s alpha. The more these items tap into a central, consistent construct, the higher the value of this statistic is. Conceptually, a higher alpha means that variation in responses to the different items reflects variation in the “true” variable being assessed by the scale items. Alpha levels range from zero to one, with higher numbers indicating more internal consistency. As a general rule, researchers want this index to be above 0.70 to have any confidence in the measure.

3. Interrater reliability refers to the consistency among judges observing participants’ behavior. The previous two forms of reliability were relevant in dealing with self-report scales; interrater reliability is more applicable when the research involves behavioral measures, which involve direct and systematic recording of observable behaviors. Imagine a researcher is studying whether alcohol consumption makes people behave more aggressively. One way to tackle this hypothesis would be to have a group of judges observe participants after drinking and rate their levels of aggression. In the same way that using multiple scale items helps to cancel out the small errors of individual items, using multiple judges cancels out the variations in each individual’s ratings. In this case, people could have slightly different ideas and thresholds for what constitutes aggression. To determine how much these differences matter, the researcher can evaluate the judges’ ratings by calculating the average correlation among the ratings. The higher the alpha values, the more the judges agree in their ratings of aggressive behavior. Conceptually, a higher alpha value means that variation in the judges’ ratings reflects real variation in levels of aggression.

Validity

Recall the watch and scale examples. Perhaps some people set their watch 10 minutes ahead to avoid being late. Or perhaps certain individuals adjust their scale by 5 pounds to boost either their motivation or self-esteem. In these cases, the watch and the scale may produce consistent measurements, but the measurements are not accurate. It turns out that the reliability of a measure is a necessary but not sufficient basis for evaluating it. Put bluntly, measures can be (and have to be) consistent, but they might still be worthless. The additional piece of the puzzle is the validity of measures, or the extent to which they accurately measure what they are designed to measure.

Whereas reliability is threatened more by random error, validity is threatened more by systematic error. If the measured scores on the happiness scale reflect, say, self-esteem more than they reflect happiness, this would threaten the validity of the scale. The previous section explained that a test designed to measure intelligence ought to be consistent over time. And, in fact, these tests do show very high degrees of reliability. However, several researchers have cast serious doubts on the validity of intelligence testing, arguing that even scores on an official IQ test are influenced by a person’s cultural background, socioeconomic status (SES), and experience with the process of test-taking (for discussion of these critiques, see Daniels et al., 1997; Gould, 1996). For example, children growing up in higher SES households tend to have more books in the home, spend more time interacting with one or both parents, and attend schools that have more time and resources available—all of which correlate with scores on IQ tests. Thus, because all of these factors could increase scores on an intelligence test, they amount to systematic error in the measure of intelligence and, therefore, threaten the validity of a measured score on an intelligence test.

Researchers have two primary ways to discuss and evaluate the validity, or accuracy, of measures: construct validity and criterion validity.

Researchers evaluate construct validity based on how well the measures capture the underlying conceptual ideas (i.e., the constructs) in a study. These constructs are equivalent to the “true score” discussed in the previous section. That is, how accurately does the bathroom scale measure the construct of weight? How accurately does an IQ test measure the construct of intelligence relative to other things? The validity of measures can be assessed in a couple of ways. On the subjective end of the continuum, researchers can evaluate construct validity by assessing the face validity of the measure, or the extent to which it simply seems like a good measure of the construct. The items from the Perceived Stress Scale have high face

8/11/20, 12(37 PMPrint

Page 4 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

validity because the items match what we intuitively mean by “stress” (e.g., “How often have you felt difficulties were piling up so high that you could not overcome them?”). However, if we were to measure an individual’s speed at eating hot dogs and then state it was a stress measure, the participant might be skeptical because hot-dog eating speed would lack face validity as a measure of stress.

Although face validity is nice to have, it can sometimes (ironically) reduce the validity of the measures. Imagine seeing the following two measures on a survey of attitudes:

1. Do you dislike people whose skin color is different from yours? 2. Do you ever beat your children?

On the one hand, these are extremely face-valid measures of attitudes about prejudice and corporal punishment—the questions very much capture our intuitive ideas about these concepts. On the other hand, even people who do support these attitudes may be unlikely to answer honestly because they recognize that neither attitude is popular. In cases like this, a measure low in face validity might end up being the more accurate approach. Chapter 4 will discuss ways to strike this balance.

On the less subjective end, researchers can evaluate construct validity by examining measures’ empirical connections to both related and unrelated constructs. Imagine for a moment that we are developing a new measure of liberal political attitudes. If we think about a person who describes herself as liberal, she is likely to support gun control, equal rights, and a woman’s right to choose. And, she is less likely to be pro-war, anti-immigration, or anti-gay rights. Therefore, we would expect our new liberalism measure to correlate positively with existing measures of attitudes toward guns, affirmative action, and abortion. This pattern of correlations taps into the metric of convergent validity, or the extent to which our measure overlaps with conceptually similar measures. But, we would want to ensure that the new measure captures something distinct from other constructs. In this case, we might want to demonstrate that we have developed a true measure of political attitudes, which does not simply correlate with religious beliefs. That is, we would want to show that liberal political views could be independent of religion. This hypothesized lack of correlations taps into the metric of discriminant validity, or the extent to which a measure diverges from unrelated measures.

To take another example, imagine someone wanted to develop a new measure of narcissism, usually defined as an intense desire to be liked and admired by other people. Narcissists tend to be self-absorbed but also very attuned to the feedback they receive from other people—especially feedback about the extent to which people admire them. Narcissism somewhat resembles self-esteem but differs enough; perhaps it is best viewed as high and unstable self-esteem. So, given these facts, a researcher might assess the discriminant validity of the measure by making sure it does not overlap too closely with measures of self-esteem or self-confidence. This approach would establish that the narcissism measure stands apart from these different constructs. The researcher might then assess the convergent validity of the measure by making sure that it does correlate with things like sensitivity to rejection and need for approval. These correlations would place the measure into a broader theoretical context and help to establish it as a valid measure of the construct of narcissism.

Criterion validity involves evaluating the validity of measures based on the association between measures and relevant behavioral outcomes. The “criterion” in this case refers to a measure that can be used to make decisions. For example, if someone developed a personality test to assess an individual’s management style, the most relevant metric of its validity is whether it predicts a person’s actual behavior as a manager. That is, we might expect people scoring high on this scale to be able to increase the productivity of their employees and to maintain a comfortable work environment. Likewise, if someone developed a measure that predicted the best careers for graduating seniors based on their skills and personalities, then

8/11/20, 12(37 PMPrint

Page 5 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Jupiterimages/Stockbyte/Thinkstock

Criterion validity can be used to predict a future behavioral outcome like management success.

criterion validity would be assessed using people’s actual success in these various careers. Whereas construct validity is more concerned with the underlying theory behind the constructs, criterion validity is more concerned with the practical application of measures. As might be expected, researchers are more likely to use this approach in applied settings.

That said, criterion validity is also a useful way to supplement validation of a new questionnaire. For example, a questionnaire about generosity should be able to predict people’s annual giving to charities, and a questionnaire about hostility ought to predict hostile behaviors. To supplement the construct validity of the narcissism measure, a researcher might examine its ability to predict the ways people respond to rejection and approval. Based on the definition of the construct, the researcher might hypothesize that narcissists would become hostile following rejection and perhaps become eager to please following approval. If these predictions were supported, it would mean further validation that the measure was capturing the construct of narcissism.

Criterion validity falls into one of two categories, depending on whether the researcher is interested in the present or the future. Predictive validity involves attempting to predict a future behavioral outcome based on the measure, as in the examples of the management-style and career-placement measures. Predictive validity is also at work when researchers (and colleges) try to predict graduates’ likelihood of school success based on SAT or GRE scores. The goal here is to validate the construct via its ability to predict the future.

In contrast, concurrent validity involves attempting to link a self-report measure with a behavioral measure collected at the same time, as in the examples of the generosity and hostility questionnaires. The phrase “at the same time” is used vaguely here; these self-report and behavioral measures may be separated by a short time span. In fact, concurrent validity sometimes involves trying to predict behaviors that occurred before completion of the scale, such as trying to predict students’ past drinking behaviors from an “attitudes toward alcohol” scale. The goal in this case is to validate the construct via its association with similar measures.

Comparing Reliability and Validity

This section has discussed how both reliability (consistency) and validity (accuracy) are ways to evaluate measured variables and to assess how well these measurements capture the underlying conceptual variable. In establishing estimates of both of these metrics, researchers essentially examine a set of correlations with their measured variables. But while reliability involves correlating variables with themselves (e.g., happiness scores at week 1 and week 4), validity involves correlating variables with other variables (e.g., happiness scale with the number of times a person smiles). Figure 2.3 displays the relationships among types of reliability and validity.

Figure 2.3: Types of reliability and validity

8/11/20, 12(37 PMPrint

Page 6 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

We learned earlier that reliability is necessary but not sufficient to evaluate measured variables. That is, reliability has to come first and is an essential requirement for any variable—no one would trust a watch that was sometimes five minutes fast and other times ten minutes slow. If we cannot establish that a measure is reliable, then there is really no chance of establishing its construct validity because every measurement might be a reflection of random error. However, just because a measure is consistent does not make it accurate. Someone’s watch might consistently be ten minutes fast; a scale might always be five pounds under the person’s actual weight. For that matter, a test of intelligence might result in consistent scores but actually be capturing respondents’ cultural background. Reliability tells us the extent to which a measure is free from random error. Validity takes the second step of telling us the extent to which the measure is also free from systematic error.

Finally, it is worth pointing out that establishing validity for a new measure is hard work. Reliability can be tested in a single step by correlating scores from multiple measures, multiple items, or multiple judges within a study. But testing the construct validity of a new measure involves demonstrating both convergent and discriminant validity. In developing our narcissism scale, we would need to show that it correlated with things like fear of rejection (convergent) but was reasonably different from things like self-esteem (discriminant). The latter criterion is particularly difficult to establish because it takes time and effort—and multiple studies—to demonstrate that one scale is distinct from another. However, an easy way exists to avoid these challenges: Use existing measures whenever possible. Before creating a brand-new happiness scale, or narcissism scale, or self-esteem scale, check the research literature to see if one exists that has already gone through the ordeal of being validated.

8/11/20, 12(37 PMPrint

Page 7 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

2.3 Scales and Types of Measurement One of the easiest ways to decrease error variance and thereby increase reliability and validity is to make smart choices when designing and selecting measures. Throughout this book, we will discuss guidelines for each type of research design and ways to ensure that measures are as accurate and unbiased as possible. This section examines some basic rules that apply across all three types of design. We first review the four scales of measurement and discuss the proper use of each one; we then turn our attention to three types of measurement used in psychological research studies.

Scales of Measurement

Whenever researchers perform the process of translating conceptual variables into measurable variables (i.e., operationalization; see Chapter 1, section 1.2), they must ensure that their measurements accurately represent the underlying concepts. In Chapter 1, the discussion of validity explained that this accuracy is a critical piece of hypothesis testing. For example, if researchers develop a scale to measure job satisfaction, then they need to verify that this is actually what the scale is measuring.

However, measurement accuracy has an additional, subtler dimension: We also need to be sure that the numbers used in our chosen measurement accurately reflect the underlying mathematical properties of the concept. In many cases in the natural sciences, this process is automatically precise. When we measure the speed of a falling object or the temperature of a boiling object, the underlying concepts (speed and temperature) translate directly into scaled measurements. In the social and behavioral sciences, though, this process is trickier; researchers have to decide carefully how best to represent abstract concepts such as happiness, aggression, and political attitudes. As researchers take the step of scaling variables, or specifying the relationship between a conceptual variable and numbers on a quantitative measure, they have four different scales to choose from, presented below in order of increasing statistical power and flexibility.

NominalNominal Scales Scales Nominal scales are used to label or identify a particular group or characteristic. For example, we can label a person’s gender as male or female, and we can label a person’s religion as Catholic, Protestant, Buddhist, Jewish, Muslim, Hindu, etc. In experimental designs, researchers can also use nominal scales to label the condition to which a person has been assigned (e.g., experimental or control groups). The assumption in using these labels is that members of the group have some common value or characteristic, as defined by the label. For example, everyone in the Catholic group should have similar religious beliefs, and everyone in the female group should be of the same gender.

Research studies commonly represent these labels using numeric codes in a data file, such as 1 to indicate females and 2 to indicate males. However, these numbers are completely arbitrary and meaningless—that is, males do not have more gender than females. We could just as easily replace the 1 and the 2 with another pair of numbers or with a pair of letters or names. Thus, the primary limitation of nominal scales is that the scaling itself is arbitrary, which prevents us from using these values in mathematical calculations. One helpful way to appreciate the difference between this scale and the next three is to think of nominal scales as qualitative, because they label and identify, and to think of the other scales as quantitative, because they indicate the extent to which someone possesses a quality or characteristic. The next sections explore these quantitative scales in more detail.

OrdinalOrdinal Scales Scales Researchers use ordinal scales to represent ranked orders of conceptual variables, such that higher numbers reflect increasing magnitude of the underlying variable. For example, beauty contestants, horses, and Olympic athletes are all ranked by the order in which they finish—first, second, third, and so on. Likewise, movies, restaurants, and consumer

8/11/20, 12(37 PMPrint

Page 8 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

HasseChr/iStock Editorial/Thinkstock

An ordinal scale can place these three women in first, second, and third, but it cannot tell you how far apart they finished in their race.

goods are often rated using a system of stars (i.e., 1 star is poor; 5 stars is excellent) to represent their quality. In these examples, we can draw conclusions about the relative speed, beauty, or deliciousness of the rating target. Even so, the numbers used to label these rankings do not necessarily map directly to differences in the conceptual variable. The fourth-place finisher in a race is rarely twice as slow as the second-place finisher; the beauty-contest winner is not three times as attractive as the third-place finisher; and the boost in quality between a four-star and a five-star restaurant is not the same as the boost between a two-star and three-star restaurant. Ordinal scales represent rank orders, but the numbers do not have any absolute value of their own. This type of scale, then, is more powerful than a nominal scale but still limited in that it does not allow performance of mathematical operations. For example, if an Olympic athlete finished first in the 800-meter dash, third in the 400-meter hurdles, and second in the 400- meter relay, we might be tempted to calculate her average finish as second place. Unfortunately, the properties of ordinal scales prevent us from doing this sort of calculation, because the conceptual distance between first, second, and third place would be different in each case. (That is, the runner might have won the 800-meter dash by 5 seconds, but the 400-meter relay by less than a second.) To perform any mathematical manipulation of variables requires one of the next two types of scale.

IntervalInterval Scales Scales Interval scales represent cases where the numbers on a measured variable correspond to equal distances on a conceptual variable. For example, temperature increases on the Fahrenheit scale represent equal intervals—warming from 40 to 47 degrees is the same increase as warming from 90 to 97 degrees. Interval scales share the key feature of ordinal scales— higher numbers indicate higher relative levels of the variable—but interval scales go an important step further. Because these numbers represent equal intervals, we are able to add, subtract, and compute averages. That is, whereas we could not calculate the athlete’s average finish, we can calculate the average temperature in San Francisco or the average age of participants.

RatioRatio Scales Scales Ratio scales go one final step further, representing interval scales that also have a true zero point, that is, the potential for a complete absence of the conceptual variable. Physical measurements, such as length, weight, and time represent ratio scales, because it is possible to have a complete absence of any of these. Most behavioral measures also represent ratio scales, as it is possible to have zero drinks per day, zero presses of a reward button, or zero symptoms of the flu. Temperature in degrees Kelvin is measured on a ratio scale because 0 degrees Kelvin indicates an absence of molecular motion. (In contrast, 0 degrees Fahrenheit is merely a center point on the temperature scale.) Contrast these measurements with many of the conceptual variables featured in psychology research—no such things as zero attitude toward gun control or zero self-esteem exist. The big advantage of having a true zero point is that it allows us to add, subtract, multiply, and divide scale values. When we measure weight, for example, it makes sense to say that a 300-pound adult weighs twice as much as a 150-pound adult. Likewise, it makes sense to say that having two drinks per day is only one-fourth as many as having eight drinks per day.

ChoosingChoosing and and Using Using Scales Scales of of Measurement Measurement

8/11/20, 12(37 PMPrint

Page 9 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…t&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

The take-home point from the discussion of these four scales of measurement is twofold. First, researchers should always use the most powerful and flexible scale possible for their conceptual variables. In many cases, no choice is possible; time is measured on a ratio scale and gender is measured on a nominal scale. But some cases permit researchers a bit more freedom in designing their study. For example, if someone were interested in correlating weight with happiness, the researcher could capture weight in a few different ways. One option would be to ask people their satisfaction with their current weight on a seven-point scale. However, the resulting data would be on an ordinal or interval scale (see discussion below), and the degree to which the researcher could manipulate the scale values would be limited. Another, more powerful option, would be to measure people’s weight on a bathroom scale, resulting in ratio-scale data. Whenever possible, it is preferable to incorporate physical or behavioral measures. But it is also preferable—actually, required—to represent data accurately. Most variables in the social and behavioral sciences do not have a true zero point and must therefore be measured on nominal, ordinal, or interval scales.

Second, researchers should always be aware of the limitations of their measurement scale. As discussed above, these scales lend themselves to different amounts of mathematical manipulation. It is not possible to calculate statistical averages with anything less than an interval scale and not possible to multiply or divide anything less than a ratio scale. What does this mean for researchers? If they have collected ordinal data, they are limited to discussing the rank ordering of the values (e.g., the critics liked Restaurant A better than Restaurant B). If they have collected nominal data, they are limited to describing the different groups (e.g., percentages of Catholics and Protestants).

One prominent grey area for both of these points is the use of attitude scales in the social and behavioral sciences. If we were to ask people to rate their attitudes about the death penalty on a seven-point rating scale, would the scale be ordinal or interval? This consideration turns out to be a contentious issue in the field. From the conservative point of view, these attitude ratings constitute only ordinal scales. We know that a 7 indicates more endorsement than a 3 but cannot say that moving from a 3 to a 4 is equivalent to moving from a 6 to a 7 in people’s minds. From the more liberal point of view, these attitude ratings can be viewed as interval scales. A researcher’s perspective is often driven by practical concerns— treating these as equal intervals allows us to compute totals and averages for our variables. Chapter 4 will return to this issue in discussing the creation of questionnaire items. For now, a good guideline is to assume that these individual attitude questions represent ordinal scales by default.

Types of Measurement

Each of the four scales of measurement can be used across a wide variety of research designs. In this section, we shift gears slightly and discuss measurement at a more conceptual, less mathematical level. The types of dependent measures used in psychological research studies can be grouped into three broad categories: behavioral, physiological, and self-report.

BehavioralBehavioral Measurement Measurement As mentioned earlier, behavioral measures are those that involve direct and systematic recording of observable behaviors. If a research question involves the ways that married couples deal with conflict, the researcher could include a behavioral measure by observing the way participants interact during an argument. Do they cut one another off? Listen attentively? Express hostility? Behaviors can be measured and quantified in one of four primary ways, as the scenario of observing married couples during conflict situations illustrates:

Frequency measurements involve counting the number of times a behavior occurs. For example, researchers could count the number of times each member of the couple rolled his or her eyes as a measure of dismissive behavior. Duration measurements involve measuring the length of time a behavior lasts. For example, researchers could quantify the length of time the couple spends discussing positive versus negative topics as a measure of emotional

8/11/20, 12(37 PMPrint

Page 10 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

tone. Intensity measurements involve measuring the strength or potency of a behavior. For example, researchers could quantify the intensity of anger or happiness in each minute of the conflict using ratings by trained judges. Latency measures involve measuring the delay before onset of a behavior. For example, researchers could measure the time between one person’s provocative statement and the other person’s response.

John Gottman, a psychologist at the University of Washington, has been conducting research along these lines for several decades, observing body language and interaction styles among married couples as they discuss an unresolved issue in their relationship (read more about this research and its implications for therapy on Dr. Gottman’s website, http://www.- gottman.com/ (http://www.gottman.com/) ). What all of these behavioral measures provide is an unobtrusive way to measure the health of a relationship. That is, the major strength of behavioral responses is that they are typically more honest and unfiltered than responses to questionnaires. As Chapter 4 will discuss, people are sometimes dishonest on questionnaires to convey a more positive (or less negative) impression.

Behavioral responses offer a particular benefit for researchers interested in unpopular attitudes, such as prejudice and discrimination. If we were to ask people the extent to which they dislike members of other ethnic groups, they might not admit to these prejudices. Alternatively, a researcher could adopt the approach used by Yale psychologist Jack Dovidio and colleagues and measure how close people sat to people of different ethnic and racial groups, using this distance as a subtle and effective behavioral measure of prejudice (see http://www.yale.edu/intergroup/ (http://www.yale.edu/intergroup/) for more information). But the primary downside to using behavioral measures may be evident: We end up having to infer the reasons that people behave as they do. Suppose that in one of these experiments, European-American participants, on average, sit farther away from African-Americans than from other European-Americans. This could—and often does— indicate prejudice; however, for the sake of argument, the farthest seat from the minority group member might also be the comfortable recliner with great lighting next to the window. To understand the reasons for behaviors, researchers have to supplement the behavioral measures with either physiological or self-report measurements.

PhysiologicalPhysiological Measurement Measurement Physiological measures are those that involve quantifying bodily processes, including heart rate, brain activity, and facial muscle movements. If we were interested in the experience of test anxiety, we could measure heart rates as people complete a difficult math test. If we wanted to study emotional reactions to political speeches, we could measure heart rate, facial muscles, and brain activity as people view video clips. These types of measures’ big advantage is that they are the least subjective and controllable. It is incredibly difficult for people to control their heart rate or brain activity consciously, making these a great tool for assessing emotional reactions. However, as with behavioral measures, we also need some way to contextualize physiological data.

The best example of this shortcoming is the use of the polygraph, or lie detector, to detect deception. The lie-detector test involves connecting a variety of sensors to the body to measure heart rate, blood pressure, breathing rate, and sweating. All of these are physiological markers of the body’s fight-or-flight stress response, and the test’s goal is to measure whether someone shows signs of stress while being questioned. But here is the problem: Being falsely accused is also stressful. A trained polygraph examiner must place all of the accused’s physiological responses in the proper context. Is the individual stressed throughout the exam or only stressed when asked whether he pilfered money from the cash box? Is the person stressed when asked about her relationship with her spouse because she killed him or because she was having an affair? The examiner has to be extremely careful to avoid false accusations based on misinterpretations of physiological responses. (For a recent commentary on the use of the polygraph in the courtroom, see http://www.thedailybeast.com/arti- cles/2015/02/04/the-polygraph-has-been-lying-for-90-years.html (http://www.thedailybeast.com/articles/2015/02/04/the- polygraph-has-been-lying-for-90-years.html) ). The same cautions apply to using these measures in psychological research: Does heart rate increase because participants are stressed by a political message, or because the experiment is taking too

http://www.gottman.com/

http://www.yale.edu/intergroup/

http://www.thedailybeast.com/articles/2015/02/04/the-polygraph-has-been-lying-for-90-years.html

8/11/20, 12(37 PMPrint

Page 11 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Digital Vision/Photodisc/Thinkstock

A self-report measure might be used to determine how likely voters are to support a candidate.

long, and they are late to another appointment? The researcher should always include additional measures in the study to help sort out the reasons behind physiological change.

Self-ReportSelf-Report Measurement Measurement Self-report measures are those that involve asking people to report on their own thoughts, feelings, and behaviors. If we were interested in the relationship between income and happiness, we could simply ask people to report their income and their level of happiness. If we wanted to know whether people were satisfied in their romantic relationships, we could simply ask them to rate their degree of satisfaction. The major advantage of these measures is that they provide access to internal processes. That is, if we want insight into why people voted for their favorite political candidate, the only option is to ask them. However, as the text has suggested already, people may not necessarily be honest and forthright in their answers, especially when dealing with politically incorrect or unpopular attitudes. Chapter 4 will return to this tension and discuss ways to increase the likelihood of honest self-reported answers.

Two broad categories of self-report measures can be used. One of the most common approaches is to ask for people’s responses using a fixed-format scale, which asks them to indicate their opinion on a preexisting scale. For example, a researcher might ask people, “How likely are you to vote for the Republican candidate for president?” on a scale from 1 (not likely) to 7 (very likely). The other broad approach is to obtain responses using a free-response format, which asks people to express their opinion in an open-ended format. For example, researchers might ask people to explain, “What are the factors you consider in choosing a political candidate?” The trade-off between these two categories is essentially a choice between data that is easy to code and analyze and data that is rich and complex. In general, fixed-format scales are used more in quantitative research, while free-response formats are used more in qualitative research. Chapter 4 will discuss these categories further in a discussion of survey research.

Research: Thinking Critically

Neuroscience and Addictive Behaviors

Follow the link below to read an article by journalist Christian Nordqvist. In this article, Nordqvist reviews recent research suggesting that food addiction might involve brain mechanisms similar to those involved in drug addiction. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.

http://www.medicalnewstoday.com/articles/221233.php (http://www.medicalnewstoday.com/articles/221233.php)

Think About It:

1. Is the study described here descriptive, correlational, or experimental? Explain. 2. Can we conclude from this study that food addiction causes brain abnormalities? Why or why not?

http://www.medicalnewstoday.com/articles/221233.php

8/11/20, 12(37 PMPrint

Page 12 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

3. The authors of the study concluded: “The current study also provides evidence that objectively measured biological differences are related to variations in YFAS (Yale Food Addiction Scale) scores, thus providing further support for the validity of the scale.” What type(s) of validity are they referring to? Explain.

4. What types of measures are included in this study (e.g., behavioral, self-report)? What are the strengths and limitations of these measures in this study?

ConvergingConverging Operations: Operations: The The Best Best of of All All Worlds Worlds As these descriptions show, each type of measurement has its strengths and flaws. So, how do researchers decide which one to use? This question has to be answered for every case, and the answer involves consideration of three factors: fit with the research question; insights from previous research; and practical considerations like budget and equipment availability. However, in an ideal world, a program of research will use a wide variety of measures and designs. The term for this approach is converging operations, or the use of multiple research methods to solve a single problem. In essence, over the course of several studies—perhaps spanning several years—a researcher would address a research question using different designs, different measures, and different levels of analysis.

One good example of converging operations comes from the research of psychologist James Gross and his colleagues at Stanford University. Gross and his team study the ways that people regulate their emotional responses and has conducted this work using everything from questionnaires to brain scans (see http://spl.stanford.edu/projects.html (http://spl.stanford.edu/projects.html) ).

One branch of Gross’s research has examined the consequences of trying to either suppress emotions (pretend they are not happening) or reappraise them (think of them in a different light). Gross’s team studies suppression by asking people to hold in their emotional reactions while watching a graphic medical video. The researchers study reappraisal by asking people to watch the same video while trying to view it as a medical student, thus changing the meaning of what they see. When people try to suppress their emotional responses, they experience an ironic increase in physiological and self- reported emotional responses, as well as deficits in cognitive and social functioning. When reappraising emotions, on the other hand, people experience lower levels of both reported and physiological emotion, without any loss of other functioning. In another branch of the research, Gross and colleagues have examined the neural processes at work when people change their perspective about an emotional event. In yet another branch of the research, they have used self-report measures to examine individual differences in emotional responses, with the goal of understanding why some people are more capable of managing their emotions than others. Taken together, these studies all converge into a more comprehensive picture of the process of emotion regulation than would be possible from any single study or method.

http://spl.stanford.edu/projects.html

8/11/20, 12(37 PMPrint

Page 13 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

2.4 Hypothesis Testing Regardless of the details of a particular study, be it correlational, experimental, or descriptive, all quantitative research follows the same process of testing a hypothesis. This section provides an overview of this process, including a discussion of the statistical logic, the five steps of the process, and the two ways we can make mistakes during our hypothesis test. Some of this material may be a review from statistics class, but it forms the basis of our scientific decision-making process and thus warrants repeating.

The Logic of Hypothesis Testing

Chapter 1 discussed several criteria for identifying a “good” theory, one of which is that theories have to be falsifiable. In other words, research questions should have the ability to be proven wrong under the right set of conditions. Why is this so important? This will sound counterintuitive at first, but by the standards of logic, when data run counter to a researcher’s theory, that is more meaningful than when data support the theory.

For example, suppose we hypothesize that growing up in a low-income family puts children at higher risk for depression. If the data fit this pattern, our prediction might very well be correct. It is also possible, however, that these results are due to a third variable—perhaps low-income families grow up in more stressful neighborhoods, and stress turns out to increase a person’s depression risk. Or, perhaps our sample accidentally contained an abnormal number of depressed people. This is why we are always cautious in interpreting positive results from a single study. Yet now, imagine that we test the same hypothesis and find that those who grew up in low-income families show a lower rate of depression. This is still a single study, but it suggests that our hypothesis may have been off-base.

Another way to think about this is from a statistical perspective. As the chapter discussed earlier, all measurements contain some amount of random error, which means that any pattern of data could be caused by random chance. This is the primary reason that research is never able to “prove” a theory. We will learn (or recall) from the study of statistics that at the end of any hypothesis test, we calculate a pp value, representing the probability of observing our results—or results that are even more extreme—due entirely to random chance. Conceptually, we are calculating the probability that we are wrong rather than the probability that we are right in our predictions. And the bigger the effect, the smaller this probability will generally be. So, as strange as it seems, the ideal result of hypothesis testing is to have a small probability of being wrong.

This focus on falsifiability carries over to the way we test our hypotheses, in that the goal is to reject the possibility of results being due to chance. The starting point of a hypothesis test is to state a null hypothesis, or the assumption that the variables have no real effect in the overall population. This is another way of saying that observed patterns of data are due to random chance. In essence, we propose this null in hopes of minimizing the odds that it is true. Then, as a counterpoint to the null hypothesis, we propose an alternative hypothesis that represents the predicted pattern of results. This part is a little confusing, because the word alternative actually refers to the hypothesis in which we are interested. The term is employed because, in statistical jargon, the alternative hypothesis represents the predicted deviation from the null. These alternative hypotheses can be directional, meaning that we specify the direction of the effect, or nondirectional, meaning that we simply predict an effect.

Say we want to test the hypothesis that people like cats better than dogs. We would start with the null hypothesis, that people like cats and dogs the same amount (i.e., no difference). The next step is to state the alternative hypothesis (that is, our actual hypothesis), which in this case is that people will prefer cats. Because we are predicting a direction (cats more than dogs), this hypothesis is directional. The other option would be a nondirectional hypothesis, or simply stating that people’s cat preferences differ from their dog preferences. (Note that we have avoided predicting which one people like better, what makes it nondirectional.)

8/11/20, 12(37 PMPrint

Page 14 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Finally, these three hypotheses can also be expressed using logical notation, as shown below. The letter H is used as an abbreviation for “hypothesis,” and the Greek letter µ is a common abbreviation for the mean, or average.

Conceptual Hypothesis: People like cats better than dogs.

Null Hypothesis: H0: µcat = µdog

the “cat” mean is equal to the “dog” mean;

people like cats and dogs the same

Nondirectional Alternative Hypothesis: H1: µcat ≠ µdog

the “cat” mean is not equal to the “dog” mean;

people like cats and dogs different amounts

Directional Alternative Hypothesis: H1: µcat > µdog

the “cat” mean is greater than the “dog” mean;

people like cats more than dogs

Why distinguish between directional and nondirectional hypotheses? A statistics class provides a more detailed answer, but it is important to note that this decision will have implications for the level of statistical significance. In essence, nondirectional hypotheses are less precise: “I think there is a difference,” versus “I believe cats are the preferred pet!” Because we always want to minimize the risk of coming to the wrong conclusion, we have to be more conservative with a nondirectional test. In this context, being conservative means needing a bigger group difference to feel confident in the results.

In the cats-versus-dogs example, a larger difference in ratings would be needed to support the claim that people like cats and dogs different amounts than would be needed to support the claim that people like cats more than dogs. The goal of all this statistical and logical jargon is to place hypothesis testing in the proper frame. The most important thing to remember is that hypothesis testing is designed to reject the null hypothesis, and statistical tests tell us how confident to be in this rejection.

Five Steps to Hypothesis Testing

Now that we understand how to frame a hypothesis, what does a researcher do with this information? Framing a hypothesis is the first step of a five-step process of testing a hypothesis. This section walks through an example of hypothesis testing from start to finish, that is, from an initial hypothesis to a conclusion about the hypothesis. Using a fictitious study, we will test the prediction that married couples without children are happier than those with children in the home. This example is inspired by an actual study by Harvard social psychologist Dan Gilbert and his colleagues, described in a news article at http://www.telegraph.co.uk/news/1941195/Marriage-without-children-the-key-to-bliss.html (http://www.telegraph.co.uk/news/1941195/Marriage-without-children-the-key-to-bliss.html) . The hypothesis may seem counterintuitive, but Gilbert’s research suggests that people tend to both overestimate the extent to which children will make them happy and underestimate the added stress and financial demands of having children in the house.

StepStep 1—State 1—State the the Hypothesis Hypothesis

http://www.telegraph.co.uk/news/1941195/Marriage-without-children-the-key-to-bliss.html

8/11/20, 12(37 PMPrint

Page 15 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

The first step in testing this hypothesis is to spell it out in logical terms. Remember that we want to start with the null hypothesis that the presence of children in a home has no effect. So, in this case, the null hypothesis would be that couples are equally happy with and without children. Or, in logical notation, H0: µchildren = µno children (i.e., the mean happiness rating for couples with children equals the mean happiness rating for couples without children). From there, we can spell out our alternative hypothesis; in this case, we predict that having children will make couples less happy. Because this is a directional hypothesis, we write H1: µchildren < µno children (i.e., the mean happiness rating for couples with children is lower than the mean happiness rating for couples without children).

StepStep 2—Define 2—Define Variables Variables Once we have an idea of the conceptual relationship that we want to test, we need to translate these concepts into measurable variables. As the chapter has discussed more than once, the decisions we make at this stage will trickle down and influence every subsequent step of the research process. For our current example, we will need to find a way to define the concept of “happiness,” as well as decide our criteria for “couples with / without children.” We have encountered happiness as an example before, so it seems fairly straightforward to define it based on participants’ responses to a happiness scale. But what does it mean for a couple to have children? Do the children need to be of a certain age, or would the study include everyone from parents of newborns to empty-nesters whose children are away at college? These types of decisions need to be made carefully, to ensure that we are controlling outside influences that might interfere with our hypothesis test. For example, couples who survive the trials and tribulations of raising a toddler without getting divorced may come to develop a more realistic set of expectations for their everyday happiness, compared to the parents of newborns or the parents of college students.

StepStep 3—Collect 3—Collect Data Data The next step is to design and conduct a study that will test our hypothesis. The next three chapters will elaborate on this process in great detail, but the general idea is the same regardless of the design. In this case, the most appropriate design would be correlational because we want to predict happiness based on whether people have children. It would be impractical and unethical to randomly assign people to have children, so an experimental design is not possible in this case. One way to conduct our study would be to survey married couples about whether they had children and ask them to rate their current level of happiness with the marriage. Suppose we conduct this study and end up with the data in Figure 2.4.

Figure 2.4: Sample data for the “children and happiness” study

8/11/20, 12(37 PMPrint

Page 16 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

As the figure shows, the results suggest an average happiness rating of 5.7 for couples without children, compared to an average happiness rating of 2.0 for couples with children. These groups certainly look different—and encouraging for our hypothesis—but we need to be sure that the difference is big enough that we can reject the null hypothesis.

StepStep 4—Calculate 4—Calculate Statistics Statistics The next step in our hypothesis test is to calculate statistical tests to decide how confident we can be that the results are meaningful. Researchers have a wide variety of statistical tools at their disposal and different ways to analyze all manner of data. These tools can be broadly grouped into descriptive statistics, which describe the patterns and distribution of measured variables, and inferential statistics, which attempt to draw inferences about the population from which the sample was drawn. Researchers use inferential statistics to make decisions about the significance of the data. Statistics courses cover many of these in detail, and we will discuss a few examples throughout this book. All of these different techniques share a common principle: They attempt to make inferences by comparing the relationship among variables to the random variability of the data. As the chapter discussed earlier, people’s measured levels of everything from happiness to heart rate can be influenced by a wide range of variables. The hope in testing our hypotheses is that differences in our measurements will primarily reflect differences in the variables we are studying. In the current example, we would want to see that differences in happiness ratings of the married couples were influenced more by the presence of children than by random fluctuations in happiness. Regardless of which statistic a researcher chooses to test the hypothesis, the resulting value will be translated into a measure of statistical significance, and this provides a key piece of information for the final decision.

StepStep 5—Make 5—Make a a Decision Decision Finally, we are able to draw a conclusion about our experiment. Based on the outcome of our statistical test (i.e., step 4), we will make one of two decisions about our null hypothesis:

RejectReject null: null: decide that the probability of the null being correct is sufficiently small; that is, results are due to differences in groups

8/11/20, 12(37 PMPrint

Page 17 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

FailFail to to reject reject null: null: decide that the probability of the null being correct is too big; that is, results are due to chance

Given the mean difference in Figure 2.4, and the small amount of error, our statistical test would certainly be significant, and we could be confident in rejecting the null hypothesis. At long last, we can express our findings in plain English: Couples with children are less happy than couples without children.

Having walked through this five-step process, we note an important fact. When it comes to analyzing data, to test hypotheses, researchers actually rely on a computer program for part of this process—Step 4 in particular. In these modern times, computing even a simple means comparison by hand is rare. Software programs such as SPSS, SAS, and Microsoft Excel can take a table of data, compute the mean difference, compare it to the variability, and calculate the probability that the results are due to chance. However, because these calculations happen behind the scenes, it is very important to understand the process. By understanding how the software operates, researchers can reach informed conclusions about their research questions. Otherwise, they risk making one of two possible errors in the hypothesis test, discussed in the next section.

Errors in Hypothesis Testing

In the children and happiness study, we concluded with a reasonable amount of confidence that our hypothesis was supported. Still, what if we made the wrong decision? Because our conclusions are based on interpreting probability, there is always a chance that we draw the wrong conclusion. In interpreting our hypothesis tests, we risk two potential errors, referred to as Type I and Type II errors.

Type I errors occur when the results are due to chance, but the researcher mistakenly concludes that the effect is significant. In other words, no effect of the variables exists in the population, but some quirk of the sample makes the effect appear significant. This error can be viewed as a false positive—researchers get excited over results that are not actually meaningful. In our children and happiness study, a Type I error would occur if children had no effect on happiness in the real world, but some quirk of chance made our “no children” group happier than the “children” group. For example, our sample of childless couples might accidentally contain a greater proportion of people with happy personalities or greater job stability or simply more marital satisfaction from the start.

Fortunately—although this error seems worrisome—we can generally compute the probability of making it. Our alpha level sets the bar for how extreme our data must be to reject the null hypothesis. At the end of the statistical calculation, a p value tells us how extreme the data actually are. When we set an alpha threshold of, say, 0.05, we are attempting to avoid a Type I error; our results will only be statistically significant if the effect outweighs the random variability by a big-enough amount. If the p value falls below our predetermined alpha level, we decide that the risk of a Type I error is sufficiently small and can therefore reject the null hypothesis. If, however, the p value is greater than (or even equal to) our alpha cutoff, we decide that the risk of Type I error is too high to ignore and will therefore fail to reject the null hypothesis.

Type II errors occur when the results are significant, but the researcher mistakenly concludes that they are due to chance. In other words, an effect of the variables does exist in the population, but some quirk of the sample makes the effect appear nonsignificant. This error can be viewed as a false negative—researchers miss results that actually could have been meaningful. In our children and happiness experiment, a Type II error would occur if couples without children really were happier than couples with children but some flaw in the experiment kept us from detecting the difference. For example, if our measures of happiness were poorly designed, people might vary in how they interpreted the items, and this source of error could make it difficult to spot an overall difference between the groups.

Although this error sounds disappointing, the good news is researchers have some fairly easy ways to avoid or minimize it.

8/11/20, 12(37 PMPrint

Page 18 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

The key factor in reducing Type II error is to maximize the power of the statistical test, or the probability of detecting a real difference. In fact, power is inversely related to the probability of a Type II error—the higher the power, the lower the chance of Type II error. Power is analogous to the sensitivity, or accuracy, of the hypothesis test; it is under the researcher’s control in three main ways. First, as the section Reliability and Validity discussed it is important to make sure that measures are capturing what the researcher thinks they are. If the happiness scale actually captures something like narcissism, then this will cause problems for the hypothesis about the predictors of happiness. Second, it is important to be careful throughout the process of coding and analyzing data. Small mistakes can occur at every step, from entering data, to calculating scale totals, to choosing an inappropriate analysis. And third, statistical tests generally have more power when the sample is larger. We will discuss each of these factors in more detail as we move through the course.

Research: Thinking Critically

The Truth About Cats and Dogs

Follow the link below to a press release on the website of the American Psychological Association. This press release describes a compelling research finding, from the social psychologist Allen McConnell, that examines the benefits of pet ownership for people’s mental health. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.

http://www.apa.org/news/press/releases/2011/07/cats-dogs.aspx (http://www.apa.org/news/press/releases/2011/07/cats-dogs.aspx)

Think About It:

1. In the first study described, 217 people answered surveys about well-being, and the researchers compared responses of pet owners to those of nonowners.

a. Is this study descriptive, correlational, or experimental? b. Can we infer a causal relationship from this study? Explain. c. Is there a possible directionality problem or third variable problem? Explain.

2. In the third study, what is the independent variable? What is the dependent variable? 3. What are the null hypotheses being tested in each of these studies? What are the alternate hypotheses? 4. What would a Type I decision error be in these studies? A Type II decision error?

SummarySummary of of Correct Correct and and Incorrect Incorrect Decisions Decisions In the real world, at the level of the entire population, our null hypothesis is either true or false. That is, if we could test our hypothesis by surveying every married couple in the world, we could say with 100% certainty whether or not the hypothesis was true. However, in each individual study, at the level of our sample, we have to decide either to reject the null or fail to reject it. Table 2.3 summarizes the four possible outcomes of a decision about a hypothesis test. In the top left and bottom right cells, we make the right decision—either rejecting a null hypothesis that is false or failing to reject one that is true in the population. In the bottom left cell of the table, we make a Type I error, rejecting a null hypothesis that is actually true, and mistakenly thinking our hypothesis is supported (i.e., a false positive). In the top right cell of the table, we make a Type II error, failing to reject a null hypothesis that is actually false, and mistakenly thinking our hypothesis should be rejected (i.e., a false negative).

http://www.apa.org/news/press/releases/2011/07/cats-dogs.aspx

8/11/20, 12(37 PMPrint

Page 19 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

diego_cervo/iStock/Thinkstock

Effect size can be used to help determine the effectiveness of a particular drug.

Table 2.3: Errors and correct decisions in hypothesis testing

Researcher’s Decision

Reject Null Fail to Reject Null

Null is FALSE Correct Decision Type II Error

Null is TRUE Type I Error Correct Decision

Chapter 1 (section 1.3) explained the process of drawing conclusions about “proof” and “disproof,” suggesting that neither one is ever possible in a single study. Now that we have covered the hypothesis-testing process, the reasoning behind rules regarding proof and disproof should be clearer. In fact, Type I and Type II errors are possible in every research study. Rejecting the null hypothesis in one study does not automatically mean that it is false, only that the null hypothesis could not explain the pattern of data in the study. Moreover, failing to reject the null in one study does not automatically mean that it is true, only that the pattern of data in the study does not support rejecting it. Science accumulates knowledge over the course of several related studies. It is only when these studies start to suggest the same conclusion that we can feel more confident in our decisions about the status of the null hypothesis.

Effect Size

So far, our discussion about hypothesis testing has been focused on statistical significance, and we have been concerned with the probability that our results might be due to random chance. However, keep in mind an additional piece of the puzzle of interpreting results. Imagine that someone has been placed in charge of testing a new drug that might help cure depression. The researcher might start by collecting a large sample of depressed patients and giving half of them the new drug and half of them a placebo. Now imagine that the new drug reduced symptoms by 20%, compared to a 10% reduction with the placebo. Is this effect big enough to become excited? If the new drug costs twice as much as existing ones, is it worth recommending? These questions revolve around the issue of effect size, a statistic used to represent the size, or magnitude, of an effect.

Size may be calculated in several ways, but as a general rule, bigger values mean a stronger effect. One of these statistics, Cohen’s dd, is calculated as the difference between two means divided by their pooled variability. In this case, our variability measure is something called the standard deviation, which represents the average deviation of individual scores from the mean of the group. A larger standard deviation indicates that the scores are dispersed more widely around the mean. When we use this number in calculating Cohen’s d, the resulting values can therefore be expressed in terms of standard deviations; a d of 1 indicates that the means are one standard deviation apart. How big should we expect our effects to be? Based on his analyses of typical effect sizes in the social sciences, Cohen suggests the following benchmarks: d = 0.20 is a small effect; d = 0.40 is a moderate effect; and d = 0.60 is a large effect. In other words, a “large” effect in social and behavioral sciences accounts for a little over half of a standard deviation. For comparison purposes, the effect of the polio vaccine on reducing polio symptoms was a d = 2.72 (almost three standard deviations; Oshinsky, 2006). Our children and happiness study produces a d = 3.82, but fake data are always more impressive than real data.

8/11/20, 12(37 PMPrint

Page 20 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Effect size is useful in two primary ways. First, at the end of an experiment, we can calculate the exact size of the effect in our particular sample. This is a useful supplement to our test of statistical significance because it is less dependent on sample size. If we fail to reject the null hypothesis in a small sample, the effect size might tell us whether the effect is big enough to test again with a larger sample. And, if we support our research hypothesis, the effect size provides valuable information about the usefulness of our findings. Imagine testing two different diabetes drugs in two different studies. Say both show a statistically significant reduction in symptoms, but Drug A has an effect size of d = 0.50, and Drug B has an effect size of d = 2.5. This tells us that Drug B has a larger effect and could therefore offer diabetes patients a bigger benefit.

The second use for effect size is in deciding on our sample size before the study begins. We learned earlier that our statistical tests generally have more power in a larger sample size. So why not run 10,000 participants in every single research study? The problem is that participants take time, money, and other resources, and not every study needs 10,000 people to detect an effect. Rather than striving for perfect power in every study, researchers usually compromise and hope for 80% power, which equates to only a 20% chance of Type II error. It turns out that we also have more power when the underlying effect is larger. Thus, we can take our estimates of effect size and determine the number of people we need to achieve at least 80% power.

The best way to perform these calculations is by using any of the power calculators available over the Internet. Figure 2.5 presents an annotated example using the calculator available at http://www.stat.ubc.ca/~rollin/stats/ssize/n2.html (http://www.stat.ubc.ca/~rollin/stats/ssize/n2.html) . The values entered represent the means from our children and happiness study, plus the pooled standard deviation of 1.25. This calculation results in the previously mentioned d of 3.82. According to this calculator, we would only need two people per group to detect this effect in a future study—much cheaper and easier than 10,000.

Figure 2.5: Example of using effect size to estimate sample size

http://www.stat.ubc.ca/~rollin/stats/ssize/n2.html

8/11/20, 12(37 PMPrint

Page 21 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

AP Photo

Copycat suicides often peak 3 days after media coverage of a high profile suicide, such as when Nirvana’s Kurt Cobain killed himself in 1994.

3.3 Archival Research Slightly further along the continuum of control is archival research, which involves drawing conclusions by analyzing existing sources of data, including both public and private records. Sociologist David Phillips (1977) hypothesized that media coverage of suicides would lead to “copycat” suicides. He tested this hypothesis by gathering archival data from two sources: front-page newspaper articles devoted to high-profile suicides and the number of fatalities in the 11-day period following coverage of the suicide. By examining these patterns of data, Phillips found support for his hypothesis. Specifically, fatalities appeared to peak three days after coverage of a suicide, and a greater degree of publicity was associated with a greater peak in fatalities.

Pros and Cons of Archival Research

It is difficult to imagine a better way to test Phillips’s hypothesis about copycat suicides. A researcher could never randomly assign people to learn about suicides and then wait to see whether they killed themselves. Nor could someone interview people right before they commit suicide to determine whether they were inspired by media coverage. Archival research provides a test of the hypothesis by examining data that already exist and, thereby, avoids most of the ethical and practical problems of other research designs. One key element of archival research is that it neatly sidesteps issues of participant reactivity, or the tendency of people to behave differently when they are aware of being observed. Any time research is conducted in a laboratory, participants know they are part of a study and may not behave in a completely natural manner. In contrast, archival data involves making use of records of people’s natural behaviors. The subjects of Phillips’s study of copycat suicides were individuals who decided to kill themselves, who had no awareness that they would be part of a research study.

Archival research is also an excellent strategy for examining trends and changes over time. For example, much of the evidence for global warming comes from observing upward trends in recorded temperatures around the globe. To gather this evidence, researchers dig into existing archives of weather patterns and conduct statistical tests of the changes over time. Psychologists and other social scientists also make use of this approach to examine population-level changes in everything from suicide rates to voting patterns over time. These comparisons can sometimes involve a blend of archival and current data. For example, a great deal of social-psychology research has been dedicated to understanding people’s stereotypes about other groups. In a classic series of studies known as the “Princeton Trilogy,” researchers documented the stereotypes held by Princeton students for 25 years (1933 to 1969). Social psychologist Stephanie Madon and her colleagues (2001) collected a new round of data but also conducted a new analysis of the previous archival data. These new analyses suggested that, over time, people have become more willing stereotype other groups, even as the stereotypes themselves have become less negative.

One final advantage of archival research is that once a researcher gains access to the relevant archives, it requires relatively few resources. The typical laboratory experiment involves one participant at a time, sometimes requiring the dedicated attention of more than one research assistant for an hour or more. After researchers assemble data from the archives, though, conducting statistical analyses is a relatively simple matter. In a 2001 article, the psychologists Shannon Stirman

8/11/20, 12(37 PMPrint

Page 22 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

and James Pennebaker used a text-analysis computer program to compare the language of poets who committed suicide (e.g., Sylvia Plath) with the language of similar poets who had not committed suicide (e.g. Denise Levertov). In total, these researchers examined 300 poems from 20 poets, half of whom had committed suicide. Consistent with Durkheim’s theory of suicide as a form of “social disengagement,” Stirman and Pennebaker (2001) found that suicidal poets used more self- references and fewer references to other people in their poems. The impressive part of the study is this: Once they had assembled their archive of poems, their computer program took only seconds to analyze the language and generate a statistical profile of each poet.

Overall, however, archival research is still relatively low on the continuum of control. Researchers have to accept the archival data in whatever form they exist, with no control over the way they were collected. For instance, in Stephanie Madon’s (2001) reanalysis of the “Princeton Trilogy” data, she had to trust that the original researchers had collected the data in a reasonable and unbiased way. In addition, because archival data often represent natural behavior, it can be difficult to categorize and organize responses in a meaningful and quantitative way. The upshot is that archival research often requires some creativity on the researcher’s part—such as analyzing poetry using a text-analysis program. In many cases, as we discuss next, the process of analyzing archives involves developing a coding strategy for extracting the most relevant information.

Content Analysis—Analyzing Archives

In most examples so far, the data come in a straightforward, ready-to-analyze form. That is, it is relatively simple to count the number of suicides, track the average temperature, or compare responses to questionnaires about stereotyping over time. In other cases, the data can come as a sloppy, disorganized mass of information. How does someone who wants to analyze literature, media images, or changes in race relations on television accomplish the analysis? These types of data can yield incredibly useful information, provided the researcher can develop a strategy for extracting it.

Mark Frank and Tom Gilovich—both psychologists at Cornell University—were interested in whether cultural associations with the color black affected behavior. In virtually all cultures, the term “black” is associated with evil—the bad guys wear black hats; people have a “black day” when things turn sour; and some are excluded from social groups by being “blacklisted” or “blackballed.” These associations appear to be independent of any culture-specific prejudices regarding race or skin color. Frank and Gilovich (1988) wondered whether “a cue as subtle as the color of a person’s clothing” (p. 74) would influence aggressive behavior. To test this hypothesis, they examined aggressive behaviors in professional football and hockey games, comparing teams whose uniforms were black to teams who wore other colors. Imagine for a moment being a researcher for this study. Professional sporting events contain a wealth of behaviors and events. How would information about the relationship between uniform color and aggressive behavior be extracted?

Frank and Gilovich (1988) solved this problem by examining public records of penalty yards (football) and penalty minutes (hockey) because these represent instances of punishment for excessively aggressive behavior, as recognized by the referees. In addition, in both sports, the size of the penalty increases according to the degree of aggression. These penalty records were obtained from the central offices of both leagues, covering the period from 1970 to 1986. Consistent with the researchers’ hypothesis, teams with black uniforms were “uncommonly aggressive” (p. 76). Most strikingly, two NHL hockey teams changed their uniforms to black during the period under study and showed a marked increase in penalty minutes with the new uniforms. One equally compelling alternative explanation is that, rather than the teams acting more aggressive in black uniforms, referees perceived them to be more aggressive while wearing black uniforms. Both explanations are consistent with the idea that cultural associations can affect behavior.

Even this analysis, however, is relatively straightforward because it involved data that were already in quantitative form (penalty yards and minutes). In many cases, the starting point is a jumbled mess of human behavior. In a pair of journal articles, psychologist Russell Weigel and colleagues (1980; 1995) examined the portrayal of race relations on prime-time

8/11/20, 12(37 PMPrint

Page 23 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

television. To do so, they had to make several critical decisions about what to analyze and how to quantify it. The process of systematically extracting and analyzing the contents of a collection of information is known as content analysis. In essence, content analysis involves developing a plan to code and record specific behaviors and events in a consistent way. We can break this plan down into a three-step process.

StepStep 1—Identify 1—Identify Relevant Relevant Archives Archives Before we develop our coding scheme, we have to start by finding the most appropriate source of data. Sometimes the choice is fairly obvious: To compare temperature trends, the most relevant archives will be weather records. To track changes in stereotyping over time, the most relevant archive is questionnaire data assessing people’s attitudes. In other cases, this decision involves careful consideration of both the research question and practical concerns. Frank and Gilovich decided to study penalties in professional sports because these data were both readily available (from the central league offices) and highly relevant to their hypothesis about aggression and uniform color.

Because these penalty records were publicly available, the researchers were able to access them easily. But if the research question involved sensitive or personal information—such as hospital records or personal correspondence—researchers would need to obtain permission from a responsible party. Say we wanted to analyze the love letters written by soldiers serving overseas and then try to predict relationship stability. Given the personal, even intimate nature of these letters, we would need permission from each person involved before proceeding with the study. However researchers manage to obtain access to private records, protecting the privacy and anonymity of the people involved is paramount. This would mean, for example, using pseudonyms and/or removing names and other identifiers from published excerpts of personal letters.

StepStep 2—Sample 2—Sample From From the the Archives Archives In Weigel’s research on race relations, the most obvious choice of archives comprised snippets of both television programming and commercials. Yet this decision was only the first step of the process. Should the researchers examine every second of every program ever aired on television? Naturally not; instead, their approach was to take a smaller sample of television programming. Chapter 4 (4.3) will discuss sampling in more detail, but the basic process involves taking a smaller, representative collection of the broader population to conserve resources. Weigel and colleagues (1980) decided to sample one week’s worth of prime-time programming from 1978, assembling videotapes of everything broadcast by the three major networks at the time (CBS, NBC, and ABC). The research team narrowed its sample by eliminating news, sports, and documentary programming because the hypotheses centered on portrayals of fictional characters of different races.

StepStep 3—Code 3—Code and and Analyze Analyze the the Archives Archives Content analysis’ third and most involved step is to develop a system for coding and analyzing the archival data. Even a sample of one week’s worth of prime-time programming contains a near-infinite amount of information. In the race- relations studies, Weigel et al. elected to code four key variables: (1) the “total human appearance time,” or time during which people were onscreen; (2) the “Black appearance time,” in which Black characters appeared onscreen; (3) the “cross-racial appearance time,” in which characters of two races were onscreen at the same time; and (4) the “cross-racial interaction time,” in which cross-racial characters interacted. In the original (1980) paper, these authors reported that Black characters were shown only 9% of the time, and cross-racial interactions only 2% of the time. Fortunately, by the time of their 1995 follow-up study, the rate of Black appearances had doubled, and the rate of cross-racial interactions had more than tripled. However, depressingly little change occurred in some of the qualitative dimensions that they measured, including the degree of emotional connection between characters of different races.

8/11/20, 12(37 PMPrint

Page 24 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

This study also highlights the variety of options for coding complex behaviors. The four key ratings of “appearance time” consist of simply recording the amount of time that each person or group is onscreen. In addition, the researchers assessed several abstract qualities of interaction using judges’ ratings. The degree of emotional connection, for instance, was measured by having judges rate the “extent to which cross-racial interactions were characterized by conditions promoting mutual respect and understanding” (Weigel et al., 1980, p. 888). As Chapter 2 (2.2) explained, any time researchers use judges’ ratings, it is important to collect ratings from more than one rater and to make sure they agree in their assessments.

A researcher’s goal is to find a systematic way to record the observations most relevant to the hypothesis. This is particularly true for quantitative research, where the key is to start with clear operational definitions that capture the variables of interest. This involves both deciding the most appropriate variables and the best way to measure these variables. For example, if someone who analyzes written communication might decide to compare words, sentences, characters, or themes across the sample. A study of newspaper coverage might code the amount of space or number of stories dedicated to a topic, while a study of television news might code the amount of airtime given to different positions. The best strategy in each case will be the one that best represents the variables of interest.

Qualitative versus Quantitative Approaches

Archival research can represent either qualitative or quantitative research, depending on the researcher’s approach to the archives. Most of the examples in this section represent the quantitative approach: Frank and Gilovich (1988) counted penalties to test their hypothesis about aggression, and Stirman and Pennebaker (2001) counted words to test their hypothesis about suicide. However, the race-relations work by Weigel and colleagues (1980; 1995) represents a nice mix of qualitative and quantitative research. In the initial 1980 study, their primary goal was to document the portrayal of race relations on prime-time television, learning from the ground up (i.e., qualitative). In the 1995 follow-up study, though, the researchers primarily wanted to determine whether these portrayals had changed over a 15-year period. That is, they tested the hypothesis that race relations were portrayed in a more positive light (i.e., quantitative). Another way in which archival research can be qualitative is to study open-ended narratives, without attempting to impose structure upon them. This approach is commonly used to study free-flowing text, such as personal correspondence or letters to the editor in a newspaper. A researcher approaching these from a qualitative perspective would attempt to learn from these narratives, without attempting to impose structure via the use of content analyses.

8/11/20, 12(37 PMPrint

Page 25 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Rayes/Photodisc/Thinkstock

Observational research can be used to measure an infant’s attachment to a caregiver.

3.4 Observational Research Moving further along the continuum of control, we come to the descriptive design with the greatest amount of researcher control. Observational research involves studies that directly observe behavior and record these observations in an objective and systematic way. Your previous psychology courses may have explored the concept of attachment theory, which argues that an infant’s bond with his or her primary caregiver has implications for later social and emotional development. Mary Ainsworth, a Canadian developmental psychologist, and John Bowlby, a British psychologist and psychiatrist, articulated this theory in the 1960s. They argued that children can form either “secure” or a variety of “insecure” attachments with their caregivers (Ainsworth & Bell, 1970; Bowlby, 1963).

To assess these classifications, Ainsworth and Bell developed an observational technique called the “strange situation.” Mothers would arrive at their laboratory with their children for a series of structured interactions, including having the mother play with the infant, leave him alone with a stranger, and then return to the room after a brief absence. The researchers were most interested in coding the ways in which the infant responded to these various episodes (eight in total). One group of infants, for example, was curious when the mother left but then returned to playing with toys, trusting that she would return. Another group showed immediate distress when the mother left and clung to her nervously upon her return. Based on these and other behavioral observations, Ainsworth and colleagues classified these groups of infants as “securely” and “insecurely” attached to their mothers, respectively.

Research: Making an Impact

Harry Harlow

In the 1950s, U.S. psychologist Harry Harlow conducted a landmark series of studies on the mother–infant bond using rhesus monkeys. Although contemporary standards would consider his research unethical, the results of his work revealed the importance of affection, attachment, and love on healthy childhood development.

Prior to Harlow’s findings, it was believed that infants attached to their mothers as a part of a drive to fulfill exclusively biological needs, in this case obtaining food and water and avoiding pain (Herman, 2007; van der Horst & van der Veer, 2008). In an effort to clarify the reasons that infants so clearly need maternal care, Harlow removed rhesus monkeys from their natural mothers several hours after birth, giving the young monkeys a choice between two surrogate “mothers.” Both mothers were made of wire, but one was bare and one was covered in terry cloth. Although the wire mother provided food via an attached bottle, the monkeys preferred the softer, terry-cloth mother, even though the latter provided no food (Harlow & Zimmerman, 1958; Herman, 2007).

Further research with the terry-cloth mothers contributed to the understanding of healthy attachment and childhood development (van der Horst & van der Veer, 2008). When the young monkeys were given the option to explore a room with their terry-cloth mothers and had the cloth mothers in the room with them, they used the mothers as a safe base. Similarly, when exposed to novel stimuli such as a loud noise, the monkeys would seek comfort from the cloth-covered surrogate (Harlow & Zimmerman, 1958). However, when the monkeys were left in the room without their cloth mothers, they reacted poorly—freezing up, crouching, crying, and screaming.

8/11/20, 12(37 PMPrint

Page 26 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

A control group of monkeys who were never exposed to either their real mothers or one of the surrogates revealed stunted forms of attachment and affection. They were left incapable of forming lasting emotional attachments with other monkeys (Herman, 2007). Based on this research, Harlow discovered the importance of proper emotional attachment, stressing the importance of physical and emotional bonding between infants and mothers (Harlow & Zimmerman, 1958; Herman, 2007).

Harlow’s influential research led to improved understanding of maternal bonding and child development (Herman, 2007). His research paved the way for improvements in infant and child care and in helping children cope with separation from their mothers (Bretherton, 1992; Du Plessis, 2009). In addition, Harlow’s work contributed to the improved treatment of children in orphanages, hospitals, day care centers, and schools (Herman, 2007; van der Horst & van der Veer, 2008).

Pros and Cons of Observational Research

Observational designs are well suited to a wide range of research questions, provided the questions can be addressed through directly observable behaviors and events. For example, researchers can observe parent–child interactions, or nonverbal cues to emotion, or even crowd behavior. However, if they are interested in studying thought processes—such as how close mothers feel to their children—then observation will not suffice. This point harkens back to the discussion of behavioral measures in Chapter 2 (2.2): In exchange for giving up access to internal processes, researchers gain access to unfiltered behavioral responses.

To capture these unfiltered behaviors, it is vital for the researcher to be as unobtrusive as possible. As we have already discussed, people have a tendency to change their behavior when they are being observed. In the bullying study by Craig and Pepler (1997) discussed at the beginning of this chapter, the researchers used video cameras to record children’s behavior unobtrusively. Imagine how (artificially) low the occurrence of bullying might be if the playground had been surrounded by researchers with clipboards!

If researchers conduct an observational study in a laboratory setting, they have no way to hide the fact that people are being observed, but the use of one-way mirrors and video recordings can help people to become comfortable with the setting. Researchers who conduct an observational study out in the real world have even more possibilities for blending into the background, including using observers who are literally hidden. For example, someone hypothesizes that people are more likely to pick up garbage when the weather is nicer. Rather than station an observer with a clipboard by the trash can, the researcher could place someone out of sight behind a tree, or perhaps sitting on a park bench pretending to read a magazine. In both cases, people would be less conscious of being observed and therefore more likely to behave naturally.

One extremely clever strategy for blending in comes from a study by the social psychologist Muzafer Sherif et al. (1954), involving observations of cooperative and competitive behaviors among boys at a summer camp. For Sherif, it was particularly important to make observations in this context without the boys realizing they were part of a research study. Sherif took on the role of camp janitor, which allowed him to be a presence in nearly all of the camp activities. The boys never paid enough attention to the “janitor” to realize his omnipresence—or his discreet note-taking. The brilliance of this idea is that it takes advantage of the fact that people tend to blend into the background once we become used to their presence.

Types of Observational Research

8/11/20, 12(37 PMPrint

Page 27 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Several variations of observational research exist, according to the amount of control that a researcher has over the data collection process. Structured observation involves creating a standard situation in a controlled setting and then observing participants’ responses to a predetermined set of events. The “strange situation” studies of parent–child attachment (discussed above) are a good example of structured observation—mothers and infants are subjected to a series of eight structured episodes, and researchers systematically observe and record the infants’ reactions. Even though these types of studies are conducted in a laboratory, they differ from experimental studies in an important way: Rather than systematically manipulate a variable to make comparisons, researchers present the same set of conditions to all participants.

Another example of structured observation comes from the research of John Gottman, a psychologist at the University of Washington. For nearly three decades, Gottman and his colleagues have conducted research on the interaction styles of married couples. Couples who take part in this research are invited for a three-hour session in a laboratory that closely resembles a living room. Gottman’s goal is to make couples feel reasonably comfortable and natural in the setting to get them talking as they might do at home. After allowing them to settle in, Gottman adds the structured element by asking the couple to discuss an “ongoing issue or problem” in their marriage. The researchers then sit back to watch the sparks fly, recording everything from verbal and nonverbal communication to measures of heart rate and blood pressure. Gottman has observed and tracked so many couples over the decades that he is able to predict, with remarkable accuracy, which couples will divorce in the 18 months following the lab visit (Gottman & Levenson, 1992).

Naturalistic observation, meanwhile, involves observing and systematically recording behavior in the real world. This can be conducted in two broad ways—with or without intervention on the part of the researcher. Intervention in this context means that the researcher manipulates some aspect of the environment and then observes people’s responses. For example, a researcher might leave a shopping cart just a few feet away from the cart-return area and track whether people move the cart. (Given the number of carts that are abandoned just inches away from their proper destination, someone must be doing this research all the time.) Recall an example from Chapter 1 (the discussion of ethical dilemmas in section 1.5) in which Harari et al. (1995) used naturalistic observation to study whether people would help in emergency situations. In brief, these researchers staged what appeared to be an attempted rape in a public park and then observed whether groups or individual males were more likely to rush to the victim’s aid.

The ABC network has developed a hit reality show that mimics this type of research. The show, What Would You Do?, sets up provocative situations in public settings and videotapes people’s reactions. An unwitting participant in one of these episodes might witness a customer stealing tips from a restaurant table, or a son berating his father for being gay, or a man proposing to his girlfriend who minutes earlier had been kissing another man at the bar. Of course, these observation “studies” are more interested in shock value than data collection (or Institutional Review Board [IRB] approval; see Section 1.5), but the overall approach can be a useful strategy to assess people’s reactions to various situations. In fact, some of the scenarios on the show are based on classic studies in social psychology, such as the well-documented phenomenon that people are reluctant to take responsibility for helping in emergencies.

Alternatively, naturalistic studies can involve simply recording ongoing behavior without any attempt by the researchers to intervene or influence the situation. In these cases, the goal is to observe and record behavior in a completely natural setting. For example, researchers might station themselves at a liquor store and observe the numbers of men and women who buy beer versus wine. Or, they might observe the numbers of people who give money to the Salvation Army bell- ringers during the holiday season. A researcher can use this approach to compare different conditions, provided the differences occur naturally. That is, researchers could observe whether people donate more money to the Salvation Army on sunny or snowy days, or compare donation rates when the bell ringers are different genders or races. Do people give more money when the bell-ringer is an attractive female? Or do they give more to someone who looks needier? These are all research questions that could be addressed using a well-designed naturalistic observation study.

Finally, participant observation involves having the researcher(s) conduct observations while engaging in the same

8/11/20, 12(37 PMPrint

Page 28 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

RENARD/BSIP/Superstock

Psychologists David Rosenhan’s study of staff and patients in a mental hospital found that patients tended to be treated based on their diagnosis, not on their actual behavior.

activities as the participants. The goal is to interact with these participants to gain better access and insight into their behaviors. In one famous example, the psychologist David Rosenhan (1973) was interested in the experience of people hospitalized for mental illness. To study these experiences, he had eight perfectly sane people gain admission to different mental hospitals. These fake patients were instructed to give accurate life histories to a doctor but lie about one diagnostic symptom. They all claimed to hear an occasional voice saying the words “empty,” “hollow,” and “thud.” Such auditory hallucinations are a symptom of schizophrenia, and Rosenhan chose these words to vaguely suggest an existential crisis.

Once admitted, these “patients” behaved in a normal and cooperative manner, with instructions to convince hospital staff that they were healthy enough to be released. In the meantime, they observed life in the hospital and took notes on their experiences—a behavior that many doctors interpreted as “paranoid note-taking.” The main finding of this study was that hospital staff tended to view all patient behaviors through the lens of their initial diagnoses. Despite immediately acting “normally,” these fake patients were hospitalized an average of 19 days (with a range from 7 to 52) before being released. All but one was diagnosed with “schizophrenia in remission” upon release. Rosenhan’s other striking finding was that treatment was generally depersonalized, with staff spending little time with individual patients.

In another example of participant observation, Festinger, Riecken, and Schachter (1956) decided to join a doomsday cult to test their new theory of cognitive dissonance. Briefly, this theory argues that people are motivated to maintain a sense of consistency among their various thoughts and behaviors. So, for example, a person who smokes a cigarette despite being aware of the health risks might rationalize smoking by convincing herself that lung-cancer risk is really just genetic. In this case, Festinger and colleagues stumbled upon the case of a woman named Mrs. Keach, who was predicting the end of the world, via alien invasion, at 11 p.m. on a specific date six months in the future. What would happen, they wondered, when this prophecy failed to come true? (One can only imagine how shocked they would have been had the prophecy turned out to be correct.)

To answer this question, the researchers pretended to be new converts and joined the cult, living among the members and observing them as they made their preparations for doomsday.

Sure enough, the day came, and 11 p.m. came and went without the world ending. Mrs. Keach first declared that she had forgotten to account for a time-zone difference, but as sunrise started to approach, the group members became restless. Finally, after a short absence to communicate with the aliens, Mrs. Keach returned with some good news: The aliens were so impressed with the devotion of the group that they decided to postpone their invasion. The group members rejoiced, rallying around this brilliant piece of rationalizing, and quickly began a new campaign to recruit new members.

As these examples illustrate, participant observation can provide access to amazing and one-of-a-kind data, including insights into group members’ thoughts and feelings. This approach also provides access to groups that might be reluctant to allow outside observers. However, the participant approach has two clear disadvantages over other types of observation. The first problem is ethical; data are collected from individuals who do not have the opportunity to give informed consent. Indeed, the whole point of the technique is to observe people without their knowledge. Before an IRB can approve this kind of study, researchers must show an extremely compelling reason to ignore informed consent, as well as extremely rigorous measures to protect identities. The second problem is methodological; the approach provides ample opportunity for the objectivity of observations to be compromised by the close contact between researcher and participant. Because the researchers are a part of the group, they can change the dynamics in subtle ways, possibly leading the group to confirm

8/11/20, 12(37 PMPrint

Page 29 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

their hypothesis. In addition, the group can shape the researchers’ interpretations in subtle ways, leading them to miss important details.

Another spin on participant observation is called ethnography, or the scientific study of the customs of people and cultures. This is very much a qualitative method that focuses on observing people in the real world and learning about a culture from the perspective of the person being studied—that is, learning from the ground up rather than testing hypotheses. Ethnography is used primarily in other social-science fields, such as anthropology. In one famous example, the cultural anthropologist Margaret Mead (1928) used this approach to shed light on differences in social norms around adolescence between American and Samoan societies. Mead’s conclusions were based on interviews she conducted over a six-month period, observing and living alongside a group of 68 young women. Mead concluded from these interviews that Samoan children and adolescents are largely ignored until they reach the age of 16 and become full members of society. Among her more provocative claims was the idea that Samoan adolescents were much more liberal in their sexual attitudes and behaviors than American adolescents.

Mead’s work has been the subject of criticism by a handful other anthropologists, one of whom has even suggested that Mead was taken in by an elaborate joke played by the group of young girls. Still others have come to Mead’s rescue and challenged the critics’ interpretations. The nature of this debate between Mead’s critics and her supporters highlights a distinctive characteristic of qualitative methods: “Winning” the argument is based on challenging interpretations of the original interviews and observations. In contrast, disagreements around quantitative methods are generally based on examining statistical results from hypothesis testing. While quantitative methods may lose much of the richness of people’s experiences, they do offer an arguably more objective way of settling theoretical disputes.

Steps in Observational Research

One of the major strengths of observational research is its high degree of ecological validity; that is, the research can be conducted in situations that closely resemble the real world. Think of the chapter examples so far—married couples observed in a living-room-like laboratory; doomsday cults observed from within; bullying behaviors on the school playground. In every case, people’s behaviors are observed in the natural environment or something very close to it. However, this ecological validity comes at a price; the real world is a jumble of information, some relevant, some not so much. The challenge for researchers, then, is to decide on a system that provides the best test of their hypothesis, one that can sort out the signal from the noise. This section discusses a three-step process for conducting observational research. The key point to note right away is that most of this process involves making decisions ahead of time so that the process of data collection is smooth, simple, and systematic.

StepStep 1—Develop 1—Develop a a Hypothesis Hypothesis For research to be systematic, it is important to impose structure by having a clear research question, and, in the case of quantitative research, a clear hypothesis as well. Other chapters have covered hypotheses in detail, but the main points bear repeating: A hypothesis must be testable and falsifiable, meaning that it must be framed in such a way that it can be addressed through empirical data and might be disconfirmed by these data. In the example involving Salvation Army donations, we predicted that people might donate more money to an attractive bell-ringer. This hypothesis could easily be tested empirically and could just as easily be disconfirmed by the right set of data—say, if attractive bell-ringers brought in the fewest donations.

This particular example also highlights an additional important feature of observational hypotheses; namely, they must be based on observable behaviors. That is, we can safely make predictions about the amount of money people will donate because we can directly observe it. We are, nonetheless, unable to make predictions in this context about the reasons for donations. We would have no way to observe, say, that people donate more to attractive bell-ringers because they are trying

8/11/20, 12(37 PMPrint

Page 30 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Steve Mason/Photodisc/Thinkstock

The dinner scene at a busy restaurant offers a wide variety of behaviors to observe. In order to simplify the observation process, researchers should narrow the focus by taking a sample.

to impress them. In sum, one limitation of observing behavior in the real world is that it prevents researchers from delving into the cognitive and motivational reasons behind the behaviors.

StepStep 2—Decide 2—Decide What What and and How How to to Sample Sample Once a researcher has developed a hypothesis that is testable, falsifiable, and observable, the next step is to decide what kind of information to gather from the environment to test this hypothesis. The simple fact is that the world is too complex to sample everything. Imagine that someone wanted to observe the dinner rush at a restaurant. A nearly infinite list of possibilities for observation presents itself: What time does the restaurant get crowded? How often do people send their food back to the kitchen? What are the most popular dishes? How often do people get in arguments with the wait staff? To simplify the process of observing behavior, the researcher will need to take a sample, or a smaller portion of the population, that is relevant to the hypothesis. That is, rather than observing “dinner at the restaurant,” the researcher’s goal is to narrow his or her focus to something as specific as “the number of people waiting in line for a table at 6 p.m. versus 9 p.m.”

The choice of what and how to sample will ultimately depend on the best fit for the hypothesis. The context of observational research offers three strategies for sampling behaviors and events. The first strategy, time sampling, involves comparing behaviors during different time intervals. For example, to test the hypothesis that football teams make more mistakes when they start to get tired, researchers could count the number of penalties in the first five minutes and the last five minutes of the game. This data would allow researchers to compare mistakes at one time interval with mistakes at another time interval. In the case of Festinger’s (1956) study of a doomsday cult, time sampling was used to compare how the group members behaved before and after their prophecy failed to come true.

The second strategy, individual sampling, involves collecting data by observing one person at a time to test hypotheses about individual behaviors. Many of the examples already discussed involve individual sampling: Ainsworth and colleagues (1970) tested their hypotheses about attachment behaviors by observing individual infants, while Gottman (1992) tests his hypotheses about romantic relationships by observing one married couple at a time. These types of data allow researchers to examine behavior at the individual level and test hypotheses about the kinds of things people do— from the way they argue with their spouses to whether they wear team colors to a football game.

The third strategy, event sampling, involves observing and recording behaviors that occur throughout an event. For example, we could track the number of fights that break out during an event such as a football game, or the number of times people leave the restaurant without paying the check. This strategy allows for testing hypotheses about the types of behaviors that occur in a particular environment or setting. For instance, a researcher might compare the number of fights that break out in a professional football versus a professional hockey game. Or, the next time we host a party, we could count the number of wine bottles versus beer bottles that end up in the recycling bin. The distinguishing feature of this strategy is its focus on occurrence of behaviors more than on the individuals performing these behaviors.

StepStep 3—Record 3—Record and and Code Code Behavior Behavior

8/11/20, 12(37 PMPrint

Page 31 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Having formulated a hypothesis and decided on the best sampling strategy, researchers must perform one final and critical step before beginning data collection. Namely, they have to develop good operational definitions of the variables by translating the underlying concepts into measurable variables. Gottman’s research turns the concept of marital interactions into a range of measurable variables, such as the number of dismissive comments and passive-aggressive sighing—all things that can be observed and counted objectively. Rosenhan’s 1973 study involving fake schizophrenic patients turned the concept of patient experience into measureable variables such as the amount of time staff members spent with each patient—again, something very straightforward to observe.

It is vital that researchers decide up front what kinds and categories of behavior they will be observing and recording. In the last section, we narrowed down our observation of dinner at the restaurant to the number of people in line at 6 p.m. versus the number of people in line at 9 p.m. But how can we be sure of an accurate count? What if two people are waiting by the door while the other two members of the group are sitting at the bar? Are those at the bar waiting for a table or simply having drinks? One possibility might be to count the number of individuals who walk through the door in different time periods, although our count could be inflated by those who give up on waiting or who only enter to sneak in and out of the restroom.

In short, observing behavior in the real world can be messy. The best way to deal with this mess is to develop a clear and consistent categorization scheme and stick with it. That is, in testing a hypothesis about the most crowded time at a restaurant, researchers would choose one method of counting people and use it for the duration of the study. In part, this choice of a method is a judgment call, but researchers’ judgment should be informed by three criteria. First, they should consider practical issues, such as whether their categories can be directly observed. A researcher can observe the number of people who leave the restaurant but cannot observe whether they got impatient. Second, they should consider theoretical issues, such as how well the categories represent the underlying theory. Why did researchers decide to study the most crowded time at the restaurant? Perhaps this particular restaurant is in a new, up-and-coming neighborhood, and they expect the restaurant to become crowded over the course of the evening. The time would also lead researchers to include people sitting both at tables and at the bar—because this crowd may come to the restaurant with the sole intention of staying at the bar. Finally, researchers should consider previous research in choosing their categories. Have other researchers studied dining patterns in restaurants? What kinds of behaviors did they observe? If these categories make sense for the project, researchers may feel free to re-use them—no need to reinvent the wheel.

Last but not least, a researcher should take a step back and evaluate both the validity and the reliability of the coding system. (See Section 2.2 for a review of these terms.) Validity in this case means making sure the categories capture the underlying variables in the hypothesis (i.e., construct validity; see Section 2.2). For example, in Gottman’s studies of marital interactions, some of the most important variables are the emotions expressed by both partners. One way to observe emotions would be to count the number of times a person smiles. However, we would have to think carefully about the validity of this measure, because smiling could indicate either genuine happiness or condescension. As a general rule, the better and more specific researchers’ operational definitions, the more valid their measures will be (Chapter 2).

Reliability in this context means making sure data are collected in a consistent way. If research involves more than one observer using the same system, their data should look roughly the same (i.e., interrater reliability). This reliability is accomplished in part by making the observation task simple and straightforward—for example, having trained assistants use a checklist to record behaviors rather than depending on open-ended notes. The other key to improving reliability is careful training of the observers, giving them detailed instructions and ample opportunities to practice the rating system.

Observation Examples

To explain how all of this comes together, we will explore a pair of examples, from research question to data collection.

8/11/20, 12(37 PMPrint

Page 32 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

ExampleExample 1—Theater 1—Theater Restroom Restroom Usage Usage First, imagine, for the sake of this example, that someone is interested in whether people are more likely to use the restroom before or after watching a movie. Such a research question could provide valuable information for theater owners in planning employee schedules (i.e., when are bathrooms most likely to need cleaning). Thus, studying patterns of human behavior results in valuable applied knowledge.

The first step is to develop a specific, testable, and observable hypothesis. In this case, we might predict that people are more likely to use the restroom after the movie, as a result of consuming those 64-ounce sodas during the movie. Just for fun, we will also compare the restroom usage of men and women. Perhaps men are more likely to wait until after the movie, whereas women are just as likely to go before as after? This pattern of data might look something like the percentages in Table 3.1. That is, men make 80% of their restroom visits after the movie and 20% before the movie, while women make about 50% of their restroom visits at each time.

Table 3.1: Hypothesized restroom visits

Gender Men Women

Before movie 20% 50%

After movie 80% 50%

Total 100% 100%

The next step is to decide on the best sampling strategy to test this hypothesis. Of the three sampling strategies discussed— individual, event, and time—which one seems most relevant here? The best option would probably be time sampling because the hypothesis involves comparing the number of restroom visitors in two time periods (before versus after the movie). So, in this case, we would need to define a time interval for collecting data. We could limit our observations to the 10 minutes before the previews begin and the 10 minutes after the credits end. The potential problem here, of course, is that some people might use either the previews or the end credits as a chance to use the restroom. Another complication arises in trying to determine which movie people are watching; in a giant multiplex theater, movies start just as others are finishing. One possible solution, then, would be to narrow the sample to movie theaters that show only one movie at a time and to define the sampling times based on the actual movie start- and end-times.

Having determined a sampling strategy, the next step is to identify the types of behaviors we want to record. This particular hypothesis poses a challenge because it deals with a rather private behavior. To faithfully record people “using the restroom,” we would need to station researchers in both men’s and women’s restrooms to verify that people actually, well, “use” the restroom while they are in it. However, this strategy poses the potential downside that the researcher’s presence (standing in the corner of the restroom) will affect people’s behavior. Another, less intrusive option would be to stand outside the restroom and simply count “the number of people who enter.” The downside to that, of course, is that we technically do not know why people are going into the restroom. But sometimes research involves making these sorts of compromises—in this case, we chose to sacrifice a bit of precision in favor of a less-intrusive measurement. This compromise would also serve to reduce ethical issues with observing people in the restroom.

So, in sum, we started with the hypothesis that men are more likely to use the restroom after a movie, while women use the restroom equally before and after. We then decided that the best sampling strategy would be to identify a movie theater showing only one movie and to sample from the 10-minute periods before and after the actual movie’s running time. Finally, we decided that the best strategy for recording behavior would be to station observers outside the restrooms and count the number of people who enter. Now, say we conduct these observations every evening for one week and collect the data in Table 3.2.

8/11/20, 12(37 PMPrint

Page 33 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Table 3.2: Findings from observing restroom visits

Gender Men Women

Before movie 75 (25%) 300 (60%)

After movie 225 (75%) 200 (40%)

Total 300 (100%) 500 (100%)

Notice that more women (N = 500) than men (N = 300) attended the movie theater during our week of sampling. The real test of our hypothesis, however, comes from examining the percentages within gender groups. That is, of the 300 men who went into the restroom, what percentage of them did so before the movie and what percentage of them did so after the movie? In this dataset, women used the restroom with relatively equal frequency before (60%) and after (40%) the movie. Men, in contrast, were three times as likely to use the restroom after (75%) than before (25%) the movie. In other words, our hypothesis appears to be confirmed by examining these percentages.

ExampleExample 2—Cell 2—Cell Phone Phone Usage Usage While While Driving Driving Imagine that we are interested in patterns of cell phone usage among drivers. Several recent studies have reported that drivers using cell phones are as impaired as drunk drivers, making this an important public safety issue. Thus, if we could understand the contexts in which people are most likely to use cell phones, it would provide valuable information for developing guidelines for safe and legal use of these devices. So, this study might count the number of drivers using cell phones in two settings: while navigating rush-hour traffic and while moving on the freeway.

The first step is to develop a specific, testable, and observable hypothesis. In this case, we might predict that people are more likely to use cell phones when they are bored in the car. So, we hypothesize that we will see more drivers using cell phones while stuck in rush-hour traffic than while moving on the freeway.

The next step is to decide on the best sampling strategy to test this hypothesis. Of the three sampling strategies discussed— individual, event, and time—which one seems most relevant here? The best option would probably be individual sampling because we are interested in the cell phone usage of individual drivers. That is, for each individual car we see during the observation period, we want to know whether the driver is using a cell phone. One strategy for collecting these observations would be to station observers along a fast-moving stretch of freeway, as well as along a stretch of road that is clogged during rush hour. These observers would keep a record of each passing car and note whether the driver is on the phone.

After selecting a sampling strategy, we next must decide the types of behaviors to record. One challenge this study presents is how broadly to define cell phone usage. Should we include both talking and text messaging? Given our interest in distraction and public safety, we probably want to include text messaging. Several states have recently banned this practice while driving, often in response to tragic accidents. Because we will be observing moving vehicles, the most reliable approach might be to simply note whether drivers have a cell phone in their hand. As with the restroom study, we sacrifice a little bit of precision (i.e., knowing what the driver is using the cell phone for) to capture behaviors that are easier to record.

To sum up, we started with the hypothesis that drivers would be more likely to use cell phones when stuck in traffic. We then decided that the best sampling strategy would be to station observers along two stretches of road who would note whether drivers were using cell phones. Finally, we decided that the cell phone usage would be defined as each driver holding a cell phone. Now, suppose we conducted these observations over a 24-hour period and collected the data in Table 3.3.

8/11/20, 12(37 PMPrint

Page 34 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Table 3.3: Findings from observing cell phone usage

Rush Hour Highway

Cell Phone 30 (30%) 200 (67%)

No Cell Phone 70 (70%) 100 (33%)

Total 100 (100%) 300 (100%)

The results show that more cars passed by on the highway (N = 300) than on the street during the rush-hour stretch (N = 100). The real test of our hypothesis, though, comes from examining the percentages within each stretch. That is, of the 100 people observed during rush hour and the 300 observed on the highway, what percentage was using cell phones? In this data set, 30% of those in rush hour were using cell phones, compared with 67% of those on the highway. In other words, the data did not confirm our hypothesis. Drivers in rush hour were less than half as likely to be using cell phones. The next step in this research program would be to speculate on the reasons the data contradicted the hypothesis.

Qualitative versus Quantitative Approaches

The general method of observation lends itself equally well to qualitative and quantitative approaches, although some types of observation fit one approach better than the other. For example, structured observation tends to focus on hypothesis testing and quantification of responses. In Mary Ainsworth’s (1970) “strange situation” research (described previously), the primary goal was to expose children to a predetermined script of events and to test hypotheses about how children with secure and insecure attachments would respond to these events. In contrast, naturalistic observation—and, to a greater extent, participant observation—tends to focus on learning from events as they occur naturally. In Leon Festinger’s “doomsday cult” study, the researchers joined the group to observe the ways members reacted when their prophecy failed to come true. Margaret Mead (1928) spent several months living with Samoan adolescents to understand social norms around coming of age.

Research: Thinking Critically

“Irritable Heart” Syndrome in Civil War Veterans

Follow the link below to an article by science writer and editor K. Kris Hirst. In this article, Hirst reviews compelling research from health psychologist Roxanne Cohen Silver and her colleagues at the University of California, Irvine. Cohen Silver and her colleagues reviewed the service records of 15,027 Civil War veterans, finding an astounding rate of mental illness—long before post-traumatic stress disorder was recognized. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.

http://psychology.about.com/od/ptsd/a/irritableheart.htm (http://psychology.about.com/od/ptsd/a/irritableheart.htm)

Think about it:

1. What hypotheses are the researchers testing in this study? 2. How did the researchers quantify trauma experienced by Civil War soldiers? Do you think this is a valid

way to operationalize trauma? Explain why or why not.

http://psychology.about.com/od/ptsd/a/irritableheart.htm

8/11/20, 12(37 PMPrint

Page 35 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

3. Would this research be best described as case studies, archival research, or natural observation? Does the study involve elements of more than one type? Explain.

8/11/20, 12(37 PMPrint

Page 36 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

3.5 Describing Your Data Before we move on from descriptive research designs, this last section discusses the process of presenting descriptive data in both graphical and numeric form. No matter how the researcher presents data, a good description is accurate, concise, and easy to understand. In other words, researchers have to represent the data accurately and in the most efficient way possible so that their audience can understand it. Another, more eloquent way to think of these principles is to take the advice of Edward Tufte, a statistician and expert in the display of visual information. Tufte (2001) suggests that when people view visual displays, they should spend time on content-reasoning rather than design-decoding. The sole purpose of designing visual presentations is to communicate information. So, the audience should spend time thinking about the information being presented, not trying to puzzle through the display itself. The following sections explain guidelines for accomplishing this goal in both numeric and visual form.

Table 3.4 presents hypothetical data from a sample of 20 participants. In this example, we have asked people to report their gender and ethnicity, as well as answer questions about their overall life satisfaction and daily stress. Each row in this table represents one participant in the study, and each column represents one of the variables for which data were collected. This chapter focuses on ways to describe the sample characteristics. Later chapters will return to these principles in discussing graphs that display the relationship between two or more variables.

Table 3.4: Raw data from a sample of 20 individuals

Subject ID Gender Ethnicity Life satisfaction Daily stress

1 Male White 40 10

2 Male White 47 9

3 Female Asian 29 8

4 Male White 32 9

5 Female Hispanic 25 3

6 Female Hispanic 35 3

7 Female White 28 8

8 Male Hispanic 40 9

9 Male Asian 37 10

10 Female African-American 30 10

11 Male White 43 8

12 Male Asian 40 4

13 Male White 48 7

14 Female African-American 30 4

15 Female White 37 7

16 Male Hispanic 40 1

17 Female White 36 1

18 Male African-American 45 8

19 Female White 42 8

8/11/20, 12(37 PMPrint

Page 37 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

20 Female African-American 38 7

Numeric Descriptions

Because psychology is a scientific discipline, it often expresses preference for presenting data in number form. These numbers provide a metric that can be used to compare findings from one study to another, to evaluate the overall consistency of whatever phenomenon is being studied. Following is a brief overview of some common numeric descriptors for data.

FrequencyFrequency Tables Tables Often, a good first step in approaching a data set is to obtain a sense of the frequencies for demographic variables—in this example, gender and ethnicity. The frequency tables shown in Table 3.5 are designed to present the number and percentage of the sample that fall into each of a set of categories. As this pair of tables shows, the sample consisted of an equal number of men and women (i.e., 50% for each gender). The majority of participants were White (45%), with the remainder divided almost equally between African-American (20%), Asian (15%), and Hispanic (20%) ethnicities.

Table 3.5: Frequency table summarizing ethnicity and sex distribution

Gender Frequency Percentage

Female 10 50.0

Male 10 50.0

Total 20 100.0

Ethnicity Frequency Percentage

African-American 4 20.0

Asian 3 15.0

Hispanic 4 20.0

White 9 45.0

Total 20 100.0

Researchers can gain a lot of information from numerical summaries of data. In fact, numeric descriptors form the starting point for doing inferential statistics and testing hypotheses. A statistics course explores these statistics in detail, but for now it is important to understand that two numeric descriptors can provide a wealth of information about a data set: measures of central tendency and measures of dispersion.

MeasuresMeasures of of Central Central Tendency Tendency The first number we need to describe our data is a measure of central tendency, which represents the most typical case in our data set. Central tendency is a single number that provides an overall sense of all the numbers. Think of what happens when colors are mixed: Adding yellow to blue creates green, so green gives us an overall sense of the combination of the two colors. In the same way, think of a household where one parent has a high salary, another has a moderate salary, and a teenager makes minimum wage. Taking the average of all three gives us an overall sense of the income for the entire household.

8/11/20, 12(37 PMPrint

Page 38 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Central tendency can be represented by these three indices:

The mean is the mathematical average of a data set, calculated by adding up all the scores in the data set and then dividing this total by the number of scores in the data set. Because we are adding and dividing our scores, the mean can only be calculated using interval or ratio data (see Chapter 2 for a review of the four scales of measurement).

The median, another measure of central tendency, represents the number in the middle of a dataset, with 50% of scores both above and below it. The median is identified by placing the list of values in ascending numeric order, then selecting the number in the middle. This measure of central tendency can be used for ordinal, interval, or ratio data because it does not require mathematical manipulation to obtain.

The final measure of central tendency, the mode, represents the most frequent score in a data set and is obtained either by visual inspection of the values or by consulting a frequency table like in the one in Table 3.5. Because the mode represents a simple frequency count, it can be used with any of the four scales of measurement. In addition, it is the only measure of central tendency that is valid for use with nominal data—that is, those that do not have a numerical value—since the numbers assigned to these data are arbitrary.

One important takeaway is that the scale of measurement largely dictates the choice between measures of central tendency —nominal scales can only use the mode, and only interval or ratio scales can use the mean. (For a review of these scales of measurement, see Chapter 2, Section 2.3.) The other piece of the puzzle is to consider which measure best represents the data. Remember that the central tendency is a way to represent the “typical” case using a single number, so the goal is to settle on the most representative number. The examples in Table 3.6 illustrate this process.

Table 3.6: Comparing the mean, median, and mode

Data Mean Median Mode Discussion

1,2,3,4,5,11,11 5.29 4 11

Both the mean and the median seem to represent the data fairly well. The mean is a slightly better choice because it hints at the higher scores. The mode is not representative—two people seem to have higher scores than everyone else.

1,1,1,5,10,10,100 18.29 5 1

The mean is inflated by the atypical score of 100 and therefore does not represent the data accurately. The mode is also not representative because it ignores the higher values. In this case, the median is the most representative value to describe this dataset.

MeasuresMeasures of of Dispersion Dispersion The second measure used to describe a dataset is a measure of dispersion, or the spread of scores around the central tendency—also referred to as measures of “variability.” Measures of dispersion tell us just how typical the typical score is. If the dispersion is low, then scores are clustered tightly around the central tendency; if dispersion is higher, then the scores stretch out farther from the central tendency. Figure 3.2 presents a conceptual illustration of dispersion. The graph on the left has a low amount of dispersion because the scores (i.e., the yellow curve) cluster tightly around the average value (i.e.,

8/11/20, 12(37 PMPrint

Page 39 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Figure 3.2: Two distributions with a low versus high amount of dispersion

the red dotted line). The graph on the right shows a high amount of dispersion because the scores (yellow curve) spread out widely from the average value (red dotted line). The graph on the right might represent the earlier example of household income: The average income represents all three family members, but between the high-earning parent and the minimum- wage-earning teenager is a fairly large spread.

One of the most straightforward measures of dispersion is the range, which is the difference between the highest and lowest scores. In Table 3.6, the range of the first dataset would be found by simply subtracting the lowest value (1) from the highest value (11), to get a range of 10. The range is useful in giving a general idea of the spread of scores, although it does not say much about how tightly these scores cluster around the mean.

The most common measures of dispersion are the variance and standard deviation, both of which represent the average difference between the mean and each individual score. The variance is calculated by subtracting each score from the mean to obtain a deviation score, squaring and summing these individual deviation scores, and then dividing by the sample size. The more scores are spread out around the mean, the higher the sum of these deviation scores will be, and therefore the higher the variance will be. Another common measure, the standard deviation (SD), is calculated as the square root of the variance.

Once we know the central tendency and the dispersion of variables, we have a good sense of what the sample looks like. These numbers also provide a valuable part in calculating the inferential statistics that we ultimately use to test our hypotheses.

StandardStandard Scores Scores So far we have discussed ways to describe a particular sample in numeric terms. What do we do when we want to compare results from different samples—or from studies using different scales? Say we want to compare the anxiety levels of two people; unfortunately, in this example, these people were measured using different anxiety scales:

Joe scored 25 on the ABC Anxiety Scale, which has a mean of 15 and a standard deviation of 2.

Deb scored 40 on the XYZ Anxiety Scale, which has a mean of 30 and a standard deviation of 10.

At first glance, Deb’s anxiety score appears higher, but note that the scales have different properties: The ABC scale has an average score of 15, while the XYZ scale has a higher average score of 30. The dispersion of these scales is also different; scores on the ABC scale cluster more tightly around the mean (i.e., the standard deviation is 2 compared to 10 on the XYZ scale).

The solution for comparing these scores is to convert both of them to standard scores (often expressed as zz scores), which represent the distance of each score from the sample mean, expressed in standard deviation units. Standard scores let researchers translate raw scores into distributions with a predefined mean and standard deviation for easier interpretation. For example, scores on IQ tests are converted (i.e., standardized) onto a scale that has a mean of 100 and a standard deviation of 15. This tells us that a person with an IQ score of 100 is right at the average for the population, while someone with a score of 130 is two standard deviations above average.

The formula for a z score is worth examining in greater detail, as a way to understand the broader concept. Memorizing or using the formula in this research methods course is not required. The formula for a z score is:

8/11/20, 12(37 PMPrint

Page 40 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

z = (x – M)/SD

This formula subtracts the mean (M) from the individual score (x) and then divides this difference by the standard deviation of the sample (SD). To compare Joe’s score with Deb’s score, we simply substitute the appropriate numbers, using the mean and standard deviation from the scale that each one completed. This enables us to place scores from very different distributions on the same scale, making them easier to compare with one another. So, in this case:

Joe: z = (x – M)/SD = (25 – 15)/2 = 10/2 = 5

Deb: z = (x – M)/SD = (40 – 30)/10 = 10/10 = 1

The resulting scores represent each person’s score in standard deviation terms: Joe is 5 standard deviations above the mean of the ABC scale, while Deb is only 1 standard deviation above the mean of the XYZ scale. Or, in plain English, Joe is considerably more anxious than Deb.

To understand just how anxious Joe is, it is helpful to know a bit about why this technique works. Anyone who has taken a statistics class will have encountered the concept of the normal distribution (or “bell curve”), a symmetric distribution with an equal number of scores on either side of the mean, as Figure 3.3 illustrates.

It turns out that many variables in the social and behavioral sciences fit this normal distribution, provided the sample sizes are large enough. A normal distribution is useful because it has a consistent set of properties, such as having the same value for mean, median, and mode. In addition, if the distribution is normal, each standard deviation cuts off a known percentage of the curve, as illustrated in Figure 3.3. That is, 68% of scores will fall within ±1 standard deviation of the mean; 95% of scores will fall within ± two standard deviations; and 99.7% of scores will fall within ± three standard deviations.

Figure 3.3: Standard deviations and the normal distribution

8/11/20, 12(37 PMPrint

Page 41 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

These percentages allow us to understand individual data points in even more useful ways, because we can easily move back and forth between z scores, percentages, and standard deviations. Take the example of Joe and Deb’s anxiety scores: Deb has a z score of 1, which means her anxiety is 1 standard deviation above the mean. Furthermore, as we can see by consulting the normal distribution (Figure 3.3), her anxiety level is higher than 84% of the population. Joe has a z score of 5, which means his anxiety is 5 standard deviations above the mean. This also means that his anxiety is higher than 99.999% of the population. (For a handy online calculator that converts between z scores and percentages, see: http://www.measuringusability.com/pcalcz.php (http://www.measuringusability.com/pcalcz.php) .)

Discussions of intelligence test scores also commonly use the relationship between z scores and percentiles. Tests that purport to measure IQ are converted to a scale that has a mean of 100 and a standard deviation of 15. Because IQ is normally distributed, we can move easily back and forth between z scores and percentages. For example, someone who has an IQ test score of 130 falls 2 standard deviations above the mean and falls in the upper 2.5% of the population. A person with an IQ test score of 70 is 2 standard deviations below the mean and thus falls in the bottom 2.5% of the population.

Ultimately, the use of standard scores allows us to take data that have been collected on different scales—perhaps in different laboratories and different countries—and place them on the same metric for comparison. As we have discussed in several contexts, science is all about the accumulation of knowledge one study at a time. The best support for an idea comes when data from different researchers, using different measures to capture the same concept, back the idea. The ability to convert these different measures back to the same metric is an invaluable tool for researchers who want to compare research results.

Visual Descriptions

http://www.measuringusability.com/pcalcz.php

8/11/20, 12(37 PMPrint

Page 42 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Displaying data in visual form is often one of the most effective ways to communicate findings—as the cliché goes, a picture is worth a thousand words. What sort of visual should a researcher use? The choice of graphs is guided by two criteria: the scale of measurement and the best fit for the results. This section introduces some of the most common visual displays, based on hypothetical data used in Table 3.4.

DisplayingDisplaying Frequencies Frequencies One common type of graph is the bar graph, which also summarizes the frequency of data by category. Figure 3.4a depicts a bar graph, showing four categories of ethnicity along the horizontal axis and the number of people falling into each category indicated by the height of the bars. So, for example, this sample contains nine White participants and four Hispanic participants. Notice that these bar graphs contain exactly the same information as the frequency table in Table 3.5. When reporting results in a paper, a researcher would, of course, use only one of these methods. More often than not, graphical displays are the most effective way to communicate information.

Figure 3.4a: Bar graph displaying frequency by ethnicity

Figure 3.4b shows another variation on the bar graph, the clustered bar graph, which summarizes the frequency by two categories at one time. In this case, the bar graph displays information about both gender and ethnicity. As in the previous graph, categories of ethnicity are displayed along the horizontal axis. But this time, we have divided the total number of each ethnicity by the gender of respondents—indicated using different colored bars. For example, notice that the nine White participants are divided into five males and four females. Similarly, the four African-American participants are divided into one male and three females.

Figure 3.4b: Clustered bar graph displaying frequency by ethnicity and gender

8/11/20, 12(37 PMPrint

Page 43 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Keep in mind that bar graphs are used for qualitative, or nominal, categories. We could just as easily have listed Caucasian participants second, third, or fourth along the axis because ethnicity is measured on a nominal scale.

When we want to present quantitative data—that is, those values measured on an ordinal, interval, or ratio scale—we use a different kind of graph called a histogram. As Figure 3.5a shows, histograms are drawn with the bars touching one another to indicate that the categories are quantitative and on a continuous scale. This figure has broken down the “life-satisfaction” values into three categories (less than 30, 31–40, and 41–50) and displayed the frequencies for each category in numerical order. For example, six people had life satisfaction scores falling between 31 and 40.

Finally, all of our bar graphs and histograms so far have displayed data that have been split into categories. However, as Figure 3.5b illustrates, histograms can also present data on a continuous scale. Figure 3.5b also has an additional new feature—a curved line overlaid on the graph. This curve represents a normal distribution and allows us to gauge visually how close our sample data are to being normally distributed.

Figure 3.5a: Histogram showing frequencies by life satisfaction (quantitative) categories

8/11/20, 12(37 PMPrint

Page 44 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Figure 3.5b: Histogram showing life satisfaction scores on a continuous scale

DisplayingDisplaying Central Central Tendency Tendency Graphs are also commonly used to display numeric descriptors in an easy-to-understand visual format. Referring back to the sample data in Table 3.4 provides information about ethnicity and gender but also about reports of daily stress and life satisfaction. Thus, a natural question is whether there are gender or ethnic differences in these two variables. Figure 3.6

8/11/20, 12(37 PMPrint

Page 45 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

displays a clustered bar graph showing the mean level of life satisfaction in each group of participants. Of note is that males appear to report more life satisfaction than females, as revealed by the fact that the red bars are always higher than the gold bars. We can also see some variation in satisfaction levels by ethnicity: African-American males (45) appear to report slightly more satisfaction than White males (42).

Figure 3.6: Clustered bar graph displaying life satisfaction scores by gender and ethnicity

These particular data are fictional, of course, but even if our graph depicted real data, we would want to be cautious in interpreting them. One reason for caution is that the data represent a descriptive study. We might be able to state which demographic groups report more life satisfaction, but we would be unable to determine the reasons for the difference. Another, more important, reason for caution is that visual presentations can be misleading, and we would need to conduct statistical analyses to discover the real patterns of differences.

The best way to appreciate this latter point is to notice what happens when we tweak the graph a little bit. The original graph in Figure 3.6 is a fair representation of the data: The scale starts at zero, and the y-axis on the left side increases by reasonable intervals. However, if we were trying to win an argument about gender differences in happiness, we could always alter the scale, as Figure 3.7 shows. These bars represent the same set of means, but we have compressed the y-axis to show only a small part of the range of the scale. That is, rather than ranging from 0 to 50, this misleading graph ranges from 28 to 45, in increments of 1. To the uncritical eye, the graph appears to show an enormous gender difference in life satisfaction; to the trained eye, it shows an obvious attempt to make the findings seem more interesting. Anytime we encounter a bar graph used to support a particular argument, we must always pay close attention to the scale of the results: Does it represent the actual range of the data, or is it compressed to exaggerate the difference? Likewise, any time researchers create a graph to display results, they have a responsibility to ensure that the graph is an accurate representation of the data.

Figure 3.7: Clustered bar graph altered to exaggerate the differences

8/11/20, 12(37 PMPrint

Page 46 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

8/11/20, 12(37 PMPrint

Page 47 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

4.3 Sampling From the Population At this point, the chapter should have conveyed an understanding of how to construct survey items. The next step is to find a group of people to fill out the survey. But where does a researcher find this group? And how many people are needed? On the one hand, researchers want as many people as possible to capture the full range of attitudes and experiences. On the other hand, they have to conserve time and other resources, which often means choosing a smaller sample of people. This section examines the strategies researchers can use to select samples for their studies.

Researchers refer to the entire collection of people who could possibly be relevant to a study as the population. For example, if we were interested in the effects of prison overcrowding, our population would consist of prisoners in the United States. If we wanted to study voting behavior in the next presidential election, the population would be U.S. residents eligible to vote. And if we wanted to know how well college students cope with the transition from high school, our population would include every college student enrolled in every college in the country.

These populations suggest an obvious practical complication. How can we get every college student—much less every prisoner—in the country to fill out our questionnaire? We cannot; instead, researchers will collect data from a sample, a subset of the population. Instead of trying to reach all prisoners, we might sample inmates from a handful of state prisons. Rather than attempt to survey all college students in the country, researchers often restrict their studies to a collection of students at one university.

The goal in choosing a sample is to make it as representative as possible of the larger population. That is, if researchers choose students at one university, they need to be reasonably similar to college students elsewhere in the country. If the phrase “reasonably similar” sounds vague, this is because the basis for evaluating a sample varies depending on the hypothesis and the key variables. For example, if we wanted to study the relationship between family income and stress levels, we would need to make sure that our sample mirrored the population in the distribution of income levels. Thus, a sample of students from a state university might be a better choice than students from, say, Harvard (which costs about $60,000 per year including room and board). On the other hand, if the research question deals with the pressures faced by students in selective private schools, then Harvard students could be a representative sample for the study.

Figure 4.1 shows a conceptual illustration of both a representative and nonrepresentative sample, drawn from a larger population. The population in this case consists of 144 individuals, split evenly between Xs and Os. Thus, we would want our sample to come as close as possible to capturing this 50/50 split. The sample of 20 individuals on the left is representative of the sample because it is split evenly between Xs and Os. But the sample of 20 individuals on the right is nonrepresentative because it contains 75% Xs. Because the population has far fewer Os than we might expect, this sample does not accurately represent the population. This failure of the sample to represent the population is also referred to as sampling bias.

Figure 4.1: Representative and nonrepresentative samples of a population

8/11/20, 12(37 PMPrint

Page 48 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

From where do these samples come? Broadly speaking, researchers have two broad categories of sampling strategies at their disposal: probability sampling and nonprobability sampling.

Probability Sampling

Researchers use probability sampling when each person in the population has a known chance of being in the sample. This is possible only in cases where researchers know the exact size of the population. For instance, the current population of the United States is 322.1 million people (www.census.gov/popclock/ (http://www.census.gov/popclock/) ). If we were to select a U.S. resident at random, each resident would have a one in 322.1 million chance of being selected. Whenever researchers have this information, probability-sampling strategies are the most powerful approach because they greatly increase the odds of getting a representative sample. Within this broad category of probability sampling are three specific strategies: simple random sampling, stratified random sampling, and cluster sampling.

Simple random sampling, the most straightforward approach, involves randomly picking study participants from a list of everyone in the population. The term for this list is a sampling frame (e.g., imagine a list of every resident of the United States). To have a truly representative random sample, researchers must have a sampling frame; they must choose from it randomly; and they must have a 100% response rate from those selected. (As Chapter 2 discussed, if people drop out of a study, it can threaten the validity of the hypothesis test.)

http://www.census.gov/popclock/

8/11/20, 12(37 PMPrint

Page 49 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

bowdenimages/iStock/Thinkstock

In a neighborhood with a majority of Caucasian residents, stratified random sampling is needed to capture the perspective of all ethnic groups in the community.

Researchers use stratified random sampling, a variation of simple random sampling, when subgroups of the population might be left out of a purely random sampling process. Imagine a city with a population that is 80% Caucasian, 10% Hispanic, 5% African American, and 5% Asian. If we were to pick 100 residents at random, the chances are very good that our entire sample would consist of Caucasian residents and ignore the perspective of all ethnic minority residents. To prevent this problem, researchers use stratified random sampling—breaking the sampling frame into subgroups and then sampling a random number from each subgroup. In this example, we could divide the list of residents into four ethnic groups and then pick a random 25 from each of these groups. The end result would be a sample of 100 people that captured opinions from each ethnic group in the population. Notice that this approach results in a sample that does not exactly represent the underlying population —that is, Hispanics constitute 25% of the sample, rather than 10%. One way to correct for this issue is to use a statistical technique known as “weighting” the data. Although the full details are beyond the scope of this book, weighting involves trying to correct for problems in representation by assigning each participant a weighting coefficient for analyses. In essence, people from groups that are underrepresented would have a weight greater than 1, while those from groups that are overrepresented would have a weight less than 1. For more information on weighting and its uses, see http://www.ap- plied-survey-methods.com/weight.html (http://www.applied-survey-methods.com/weight.html) .

Finally, researchers employ cluster sampling, another variation of random sampling, when they do not have access to a full sampling frame (i.e., a full list of everyone in the population). Imagine that we want to study how cancer patients in the United States cope with their illness. Because no list exists of every cancer patient in the country, we have to get a little creative with our sampling. The best way to think about cluster sampling is as “samples within samples.” Just as with stratified sampling, we divide the overall population into groups, but cluster sampling differs in that we are dividing into groups based on more than one level of analysis. In our cancer example, we could start by dividing the country into regions, then randomly selecting cities from within each region, and then randomly selecting hospitals from within each city, and finally randomly selecting cancer patients from each hospital. The end result would be a random sample of cancer patients from, say, Phoenix, Miami, Dallas, Cleveland, Albany, and Seattle; taken together, these patients would provide a fairly representative sample of cancer patients around the country.

Nonprobability Sampling

The other broad category of sampling strategies is known as nonprobability sampling. These strategies are used in the (remarkably common) case in which researchers do not know the odds of any given individual’s being in the sample. This uncertainty represents an obvious shortcoming—if we do not know the exact size of the population and do not have a list of everyone in it, we have no way to know that our sample is representative. Despite this limitation, researchers use nonprobability sampling on a regular basis. We will discuss two of the most common nonprobability strategies here.

In many cases, it is not possible to obtain a sampling frame. When researchers study rare or hard-to-reach populations or study potentially stigmatizing conditions, they often recruit by word-of-mouth. The term for this is snowball sampling— imagine a snowball rolling down a hill, picking up more snow (or participants) as it goes. If we wanted to study how often homeless people took advantage of social services, we would be hard pressed to find a sampling frame that listed the homeless population. Instead, we could recruit a small group of homeless people and ask each of them to pass the word along to others, and so on. If we wanted to study changes in people’s identities following sex-reassignment surgery, we

http://www.applied-survey-methods.com/weight.html

8/11/20, 12(37 PMPrint

Page 50 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

would find it difficult to track down this population via public records. Instead, we could recruit one or two patients and ask for referrals to others. The resulting sample in both cases is unlikely to be representative, but researchers often have to compromise for the sake of obtaining access to a population. Snowball sampling is most often used in qualitative research, where the advantages of gaining a rich narrative from these individuals outweigh the loss of representativeness.

One of the most popular nonprobability strategies is known as convenience sampling, or simply including people who show up for the study. Any time a 24-hour news station announces the results of a viewer poll, they are likely based on a convenience sample. CNN and Fox News do not randomly select from a list of their viewers; they post a question onscreen or online, and people who are motivated (or bored) enough to respond will do so. As a matter of fact, the vast majority of psychology research studies are based on convenience samples of undergraduate college students. Research in psychology departments often works like this: Experimenters advertise their studies on a website, and students enroll in these studies, either to earn extra cash or to fulfill a research requirement for a course. Students often pick a particular study based on whether it fits their busy schedules or whether the advertisement sounds interesting. These decisions are hardly random and, consequently, neither is the sample. The goal here is not to disparage all psychology research—that would be self- defeating—but to emphasize that all of the decisions researchers make have both pros and cons.

Choosing a Sampling Strategy

Although researchers always strive for a representative sample, no such thing as a perfectly representative one exists. Some degree of sampling error, defined as the degree to which the characteristics of the sample differ from the characteristics of the population, is always present. Instead of aiming for perfection, then, researchers aim for an estimate of how far from perfection their samples are. These estimates are known as the margin of error, or the degree to which the results from a particular sample are expected to deviate from the population as a whole.

One of the main advantages of a probability sample is that we are able to calculate these errors, as long as we know our sample size and desired level of confidence. In fact, most of us encounter margins of error every time we see the results of an opinion poll. For example, CNN may report that “Candidate A is leading the race with 60% of the vote, ± 3%.” This means Candidate A’s approval percentage in the sample is 60%, but based on statistical calculations, her real percentage is between 57% and 63%. The smaller the error (3% in this example), the more closely the results from the sample match the population. Naturally, researchers conducting these opinion polls want the error of estimation to be as small as possible. How persuaded would anyone be to learn that “Candidate A has a 10-point lead, plus or minus 20 points?” This margin of error ought to trigger our skepticism, because the real difference is between 30 points and –10 points—i.e., a 10-point lead for the other candidate.

Researchers’ most direct means of controlling the margin of error is by changing the sample size. Most survey research aims for a margin of error of less than five percentage points. Based on standard calculations, this requires a sample size of 400 people per group. That is, if we want to draw conclusions about the entire sample (e.g., “30% of registered voters said X”), then we would need at least 400 respondents to say this with some confidence. If we want to draw conclusions about subgroups (e.g., “30% of women compared to 50% of men”), then we would actually need at least 400 respondents of each gender to draw conclusions with confidence.

The magic number of 400 represents a compromise—a researcher is willing to accept 5% error for the sake of keeping time and costs down. It is worth noting, however, that some types of research have more stringent standards: For political polls to be reported by the media, they must have at least 1,000 respondents, which brings the margin of error down to three percentage points. In contrast, some areas of applied research may have more relaxed standards. In marketing research, for example, budget considerations sometimes lead to smaller samples, which means drawing conclusions at lower levels of confidence. For example, with a sample size of 100 people per group, researchers have to contend with 8–10% margin of error—almost double the error, but at a fraction of the costs.

8/11/20, 12(37 PMPrint

Page 51 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

If probability sampling is so powerful, why are nonprobability strategies so popular? One reason is that convenience samples are more practical; they are cheaper, easier, and almost always possible to conduct with relatively few resources because researchers can avoid the costs of large-scale sampling. A second reason is that convenience is often a good- enough starting point for a new line of research. For example, if we wanted to study the predictors of relationship satisfaction, we could start by testing our hypotheses in a controlled setting using college student participants and then extend the research to the study of adult married couples. Finally, and relatedly, in many cases it is acceptable to have a nonrepresentative sample because researchers do not need to generalize results. If we want to study the prevalence of alcohol use in college students, it may be perfectly acceptable to use a convenience sample of college students. Although, even in this case, researchers would have to keep in mind that they are studying drinking behaviors among students who volunteered to complete a study on drinking behaviors.

In some cases, however, it is critical to use probability sampling, despite the extra effort required. Specifically, researchers use probability samples any time it is important to generalize and any time it is important to predict behavior of a population. The best way of understanding these criteria is to think of political polls. In the lead-up to an election, each campaign is invested in knowing exactly what the voting public thinks of its candidate. In contrast to a CNN poll, which is based on a convenience sample of viewers, polls conducted by a campaign will be based on randomly selected households from a list of registered voters. The resulting sample is much more likely to be representative, much more likely to tell the campaign how the entire population views its candidate, and therefore much more likely to be useful.

8/11/20, 12(37 PMPrint

Page 52 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Learning Outcomes

By the end of this chapter, you should be able to:

Use appropriate terminology when discussing experimental designs. Identify the key features of experiments for making causal statements. Explain the importance of both internal and external validity in experiments. Describe the threats to both internal and external validity in experiments. Outline the most common types of experimental designs. Describe methods for analyzing experimental data. Summarize methods for avoiding Type I and Type II error.

One of the oldest debates within psychology concerns the relative contributions of biology and the environment in shaping

5 Experimental Designs—Explaining Behavior

Antonio Oquias/Hemera/Thinkstock

8/11/20, 12(37 PMPrint

Page 53 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

our thoughts, feelings, and behaviors. Do we become who we are because it is hard-wired into our DNA, or because of our early experiences? Do people share their parents’ personality quirks because they carry their parents’ genes, or because they grew up in their parents’ homes? Researchers can, in fact, address these types of questions in several ways. A consortium of researchers at the University of Minnesota has spent the past three decades comparing pairs of identical and fraternal twins, raised in the same versus different households, to tease apart the contributions of genes and environment. Read more at the research group’s website, http://mctfr.psych.umn.edu/ (http://mctfr.psych.umn.edu/) .

An alternative way to separate genetic and environmental influence is through the use of experimental designs, which have the primary goal of explaining the causes of behavior. Recall from the design overview in Chapter 2 (2.1) that experiments can address causal relationships because the experimenter has control over the environment as well as over the manipulation of variables. One particularly ingenious example comes from the laboratory of Michael Meaney, a professor of psychiatry and neurology at McGill University. Meaney used female rats as experimental subjects (Francis, Dioro, Liu, & Meaney, 1999). His earlier research had revealed that the parenting ability of female rats could be reliably classified based on how attentive they were to their rat pups, as well as how much time they spent grooming the pups. The question tackled in the 1999 study was whether these behaviors were learned from the rats’ own mothers or transmitted genetically. To answer this question experimentally, Meaney and colleagues had to think very carefully about the comparisons they wanted to make. To simply compare the offspring of good and bad mothers would have been insufficient—this approach could not distinguish between genetic and environmental pathways.

Instead, Meaney decided to use a technique called cross-fostering, or switching rat pups from one mother to another as soon as they were born. The technique resulted in four combinations of rats: (1) those born to inattentive mothers but raised by attentive ones, (2) those born to attentive mothers but raised by inattentive ones, (3) those born and raised by attentive mothers, and (4) those born and raised by inattentive mothers. Meaney then tested the rat pups several months later and observed the way they behaved with their own offspring. Meaney’s control over all aspects of how the rat pups were raised was a critical element; he was able to keep everything the same except for the combination of their genetics and rearing environment. The setup of this experiment allowed Meaney to make clear comparisons between the influence of birth mothers and the rearing process. At the end of the study, the conclusion was crystal clear: Maternal behavior is all about the environment. Those rat pups that ultimately grew up to be inattentive mothers were those who had been raised by inattentive mothers.

This final chapter is dedicated to experimental designs, in which the primary goal is to explain behavior. Experimental designs rank highest on the continuum of control (see Figure 5.1) because the experimenter can manipulate variables, minimize extraneous variables, and assign participants to conditions. The chapter begins with an overview of the key features of experiments and then explains the importance of both internal and external validity of experiments. From there, the discussion moves to the process of designing and interpreting experiments and concludes with a summary of strategies for minimizing error in experiments.

Figure 5.1: Experimental designs on the continuum of control

http://mctfr.psych.umn.edu/

8/11/20, 12(37 PMPrint

Page 54 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

8/11/20, 12(37 PMPrint

Page 55 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

5.1 Experiment Terminology Before we dive into the details, it is important to cover the terminology that the chapter will use to describe different aspects of experimental designs. Much of this will be familiar from Chapter 2, with a few new additions. First, we will review the basics.

Recall that a variable is any factor that has more than one value. For example, height is a variable because people can be short, tall, or anywhere in between. Depression is a variable because people can experience a wide range of symptoms, from mild to severe. The independent variable (IV) is the variable that is manipulated by the experimenter to test hypotheses about cause. The dependent variable (DV) is the variable that is measured by the experimenter to assess the effects of the independent variable. For example, in an experiment testing the hypothesis that fear causes prejudice, fear would be the independent variable and prejudice would be the dependent variable. To keep these terms straight, it is helpful to think of the main goal of experimental designs. That is, we test hypotheses about cause by manipulating an independent variable and then looking for changes in a dependent variable. This means that we think the independent variable causes changes in the dependent variable; for example, we hypothesize that fear causes changes in prejudice.

When we manipulate an independent variable, we will always have two or more versions of the variable; this is what distinguishes experiments from, say, structured observational studies. One common way to describe the versions of the IV is in terms of different groups, or conditions. The most basic experiments have two conditions: The experimental condition receives a treatment designed to test the hypothesis, while the control condition does not receive this treatment. In the fear and prejudice example above, the participants who make up the experimental condition would be made to feel afraid, while the participants who make up the control condition would not. This setup allows us to test whether introducing fear to one group of participants leads them to express more prejudice than the other group of participants, who are not made fearful.

Another common way to describe these versions is in terms of levels of the independent variable. Levels describe the specific set of circumstances created by manipulating a variable. For example, in the fear and prejudice experiment, the variable of fear would have two levels—afraid and not afraid. We have countless ways to operationalize fear in this experiment. One option would be to adopt the technique used by the Stanford social psychologist Stanley Schachter (1959), who led participants to believe they would be exposed to a series of painful electric shocks. In Schachter’s study, the painful shocks never happened, but they did induce a fearful state as people anticipated them. So, those at the “afraid” level of the independent variable might be told to expect these shocks, while those at the “not afraid” level of the independent variable would not be given this expectation.

At this stage, having two sets of vocabulary terms—”levels” and “conditions”—for the same concept may seem odd. However, with advanced experimental designs using multiple independent variables, there is a subtle difference in how these terms are used. As the designs become more complex, it is often necessary to expand IVs to include several groups and multiple variables. At that point, researchers need different terminology to distinguish between the versions of one variable and the combinations of multiple variables. The chapter will later return to this complexity, in the section “Experimental Design.”

8/11/20, 12(37 PMPrint

Page 56 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

5.2 Key Features of Experiments The overview of designs in Chapter 2 described the overall process of experiments in the following way: Researchers control the environment as much as possible so that all participants have the same experience. The researchers then manipulate, or change, one key variable, and then measure the outcomes in another key variable. This section examines this process in more detail. Experiments can be distinguished from all other designs by three key features: manipulating variables, controlling the environment, and assigning people to groups.

Manipulating Variables

The most crucial element of an experiment is researcher’s manipulation, or change, of some key variable. To study the effects of hunger, for example, a researcher could manipulate the amount of food given to the participants, or to study the effects of temperature, the experimenter could raise and lower the temperature of the thermostat in the laboratory. In both cases, recall that the researcher needs a way to operationalize the concepts (hunger and temperature) into measurable variables. For example, the experimenter could define “hungry” as being deprived of food for eight hours, and define a “hot” room as being 90 degrees Fahrenheit. Because these factors are under the direct control of the experimenters, they can feel more confident that changing them contributes to changes in the dependent variables.

Chapter 2 discussed the main shortcoming of correlational research: These designs do not allow researchers to make causal statements. Recall from that chapter (as well as from Chapter 4) that correlational research is designed to predict one variable from another. One of the examples in Chapter 2 concerned the correlation between income levels and happiness, with the goal of trying to predict happiness levels based on knowing people’s income level. If we measure these as they occur in the real world, we cannot say for sure which variable causes the other. However, we could settle this question relatively quickly with the right experiment. Suppose we bring two groups into the laboratory and give one group $100 and a second group nothing. If the first group is happier at the end of the study, it would support the idea that money really does buy happiness. Of course, this experiment is a rather simplistic look at the connection between money and happiness. Even so, because we manipulate levels of money, this study would bring us closer to making causal statements about the effects of money.

To manipulate variables, it is necessary to have at least two versions of the variable. That is, to study the effects of money, we need a comparison group that does not receive money. To study the effects of hunger, we would need both a hungry and a not-hungry group. Having two versions of the variable distinguishes experimental designs from the structured observations discussed in Chapter 3 (3.4), in which all participants receive the same set of conditions in the laboratory. Even the most basic experiment must have two sets of conditions, which are often an experimental group and a control group. However, as this chapter will later explain, experiments can become much more complex. A study might have one experimental group and two control groups, or five degrees of food deprivation, ranging from 0 to 12 hours without food. Decisions about the number and nature of these groups will depend on consideration of both the hypotheses and previous literature.

8/11/20, 12(37 PMPrint

Page 57 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Monkey Business Images/Monkey Business/Thinkstock

Having a patient run on a treadmill to measure cardiovascular stress is an example of invasive manipulation.

Researchers have three options for manipulating variables. First, environmental manipulations involve changing some aspect of the setting. Environmental manipulations are perhaps the most common in psychology studies, and they include everything from varying the room temperature to varying the amount of money people receive. The key is to change the way that different groups of people experience their time in the laboratory—it is either hot or cold, and they either receive or do not receive $100.

Second, instructional manipulations involve changing the way a task is described to change participants’ mindsets. For example, a researcher might give the same math test to all participants but to one group, describe it as an “intelligence test” and to another group, a “problem-solving task.” Because an intelligence test is thought to have implications for life success, the experimenter might expect participants in that group to be more nervous about their scores.

Finally, an invasive manipulation involves taking measures to change internal, physiological processes; it is usually conducted in medical settings. For example, studies of new drugs involve administering the drug to volunteers to determine whether it has an effect on some physical or psychological symptom. Alternatively, studies of cardiovascular health often involve having participants run on a treadmill to measure how the heart functions under stress.

The rule that we must manipulate a variable has one qualification. In many experiments, researchers divide participants based on a preexisting difference (e.g., gender) or personality measures (e.g., self-esteem or neuroticism) that capture stable individual differences among people. The idea behind these personality measures is that someone scoring high on a measure of neuroticism (for example) would be expected to be more neurotic across situations than someone scoring lower on the measure. Using this technique allows a researcher to compare how, for example, men and women or people with high and low self-esteem respond to manipulations.

When researchers use preexisting differences in an experimental context, they are referred to as quasi-independent variables—”quasi,” or “nearly,” because they are being measured, not manipulated, by the experimenter, and thus do not meet the criteria for a regular independent variable. In fact, variables used in this way are things that cannot be manipulated by an experimenter—either for practical or ethical reasons—including gender, race, age, eye color, religion, and so forth. Instead, these are treated as independent variables in that participants are divided into groups along these variables (e.g., male versus female; Catholic versus Protestant versus Muslim).

Because these variables are not manipulated, an experimenter cannot make causal statements about them. For a study to count as an experiment, these quasi-independent variables would have to be combined with a true independent variable. This could be as simple as comparing how men and women respond to a new antidepressant drug—gender would be quasi- independent while drug type would be a true independent variable.

Sometimes the line between true and quasi-experiments can be subtle. Imagine we want to study the effects on people’s persistence at a second task based on winning versus losing a contest. In a quasi-experimental approach, we could have two participants play a game, resulting in a natural winner and loser, and then compare how long each one stuck with the next game. The approach’s limitation is that some preexisting condition might have affected winning and losing the first game. Perhaps the winners had more self-confidence and patience at the start. However, we could improve the design to be a true

8/11/20, 12(37 PMPrint

Page 58 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

experiment by having participants play a rigged game against a confederate, thereby causing participants either to win or lose. In this case, we would be manipulating winning and losing, and preexisting differences would be averaged out across the groups (more on this later in the chapter).

Controlling the Environment

The second important element of experimental designs is the researcher’s high degree of control over the environment. In addition to manipulating variables, an experimenter has to ensure that the other aspects of the environment are the same for all participants. For instance, if we were interested in the effects of temperature on people’s mood, we could manipulate temperature levels in the laboratory so that some people experienced warmer temperatures and other people cooler temperatures. However, it is equally important to make sure that other potential influences on mood are the same for both groups. That is, we would want to make sure that the “warm” and “cool” groups were tested in the same room, around the same time of day, and by similar experimenters.

The overall goal, then, is to control extraneous variables, or variables that add noise to the hypothesis test. In essence, the more researchers can control extraneous variables, the more confidence they can have in the results of the hypothesis test. As the section “Validity and Control” will discuss, these extraneous variables can have different degrees of impact on a study. Imagine we conduct the study on temperature and mood, and all of our participants are in a windowless room with a flickering fluorescent light. This environment would likely influence people’s mood—making everyone a little bit grumpy —but it causes fewer problems for our hypothesis test because it affects everyone equally. Table 5.1 shows hypothetical data from two variations of this study, using a 10-point scale to measure mood ratings. In the top row, participants were in a well-lit room; notice that participants in the cooler room reported being in a better mood (i.e., an 8 versus a 5). In the bottom row, all participants were in the windowless room with flickering lights. These numbers suggest that people were still in a better mood in the cooler room (5) than a warm room (2), but the flickering fluorescent light had a constant dampening effect on everyone’s mood.

Table 5.1: Influence of an extraneous variable

Cool Room Warm Room

Variation 1: Well-Lit 8 5

Variation 2: Flickering Fluorescent 5 2

Assigning People to Conditions

The third key feature of experimental designs is that the researcher can assign people to receive different conditions, or versions, of the independent variable. This is an important piece of the experimental process: Experimenters not only control the options—warm versus cool room, $100 versus no money, etc.—but they also control which participants get each option. Whereas a correlational design might assess the relationship between current mood and choosing the warm room, an experimental design will assign some participants to the warm room and then measure the effects on their mood. In other words, experimenters are able to make causal statements because they cause things to happen to a particular group of people.

The most common, and most preferable, way to assign people to conditions is through a process called random assignment. An experimenter who uses random assignment makes a separate decision for each participant as to which group he or she will be assigned to before the participant arrives. As the term implies, this decision is made randomly—by flipping a coin, using a random number table (for an example, see http://stattrek.com/tables/random.aspx (http://stattrek.com/tables/random.aspx) ), drawing numbers out of an envelope, or even simply alternating back and forth

http://stattrek.com/tables/random.aspx

8/11/20, 12(37 PMPrint

Page 59 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

between experimental conditions. The overall goal is to try to balance preexisting differences among people, as Figure 5.2 illustrates. So, for example, some people might generally be more comfortable in warm rooms, while others might be more comfortable in cold rooms. If each person who shows up for the study has an equal chance of being in either group, then the groups in the sample should reflect the same distribution of differences as the population.

Figure 5.2: Random assignment

The 24 participants in our sample consist of a mix of happy and sad people. The goal of random assignment is to have these differences distributed equally across the experimental conditions. Thus, the two groups on the right each consist of six happy and six sad people, and our random assignment was successful.

Forming groups through random assignment also has the significant advantage of helping to avoid bias in the selection and assignment of subjects. For example, it would be a bad idea to assign people to groups based on a first impression of them because participants might be placed in the cold room if they arrived at the laboratory dressed in warm clothing. Experimenters who make decisions about condition assignments ahead of time can be more confident that the independent variable is responsible for changes in the dependent variable.

Worth highlighting here is the difference here between random selection and random assignment (discussed in Chapter 4). Random selection means that the sample of participants is chosen at random from the population, as with the probability sampling methods discussed in Chapter 4. However, most psychology experiments use a convenience sample of individuals who volunteer to complete the study. This means that the sample is often far from fully random. However, a researcher can still make sure that the study involves random assignment to groups, so that each condition contains an equal representation of the sample.

In some cases—most notably, when samples are small—random assignment may not be sufficient to balance an important characteristic that might affect the results of a particular study. Imagine conducting a study that compared two strategies for teaching students complex math skills. In this example, it would be especially important to make sure that both groups contained a mix of individuals with, say, average and above-average intelligence. For this reason, the experimenter would necessarily take extra steps to ensure that intelligence was equally distributed between the groups, which can be accomplished with a variation on random assignment called matched random assignment. This kind of assignment requires the experimenter to obtain scores on an important matching variable—in this case, intelligence—rank participants based on the matching variable, and then randomly assign people to conditions. Figure 5.3 shows how this process would unfold in our math-skills study. First, the researcher gives participants an IQ test to measure preexisting differences in intelligence. Second, the experimenter ranks participants based on these scores, from highest to lowest. Third, the

8/11/20, 12(37 PMPrint

Page 60 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

experimenter moves down this list in order and randomly assigns each participant to one of the conditions. This process still contains an element of random assignment, but adding the extra step of rank ordering ensures a more balanced distribution of intelligence test scores across the conditions.

Figure 5.3 Matched random assignment

The 20 participants in our sample represent a mix of very high, average, and very low intelligence test scores (measured 1–100). The goal of matched random assignment is to ensure that this variation is distributed equally across the two conditions. The experimenter would first rank participants by intelligence test scores (top box), and then distribute these participants alternately between the conditions. The end result is that both groups (lower boxes) contain a good mix of high, average, and low scores.

8/11/20, 12(37 PMPrint

Page 61 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

5.3 Experimental Validity Chapter 2 discussed the concept of validity, or the degree to which the measures used in a study capture the constructs that they were designed to capture. That is, a measure of happiness needs to capture differences in people’s levels of happiness. This section returns to the subject of validity in an experimental context, assessing whether experimental results demonstrate the causal relationships that researchers think they are demonstrating. We will discuss two types of validity that are relevant to experimental designs. The first is internal validity, which assesses the degree to which results can actually be attributed to the independent variables. The second is external validity, which assesses how well the results generalize to situations beyond the specific conditions laid out in the experiment. Taken together, internal and external validity provide a way to assess the merits of an experiment. However, each kind has its own threats and remedies, as the following sections explain.

Internal Validity

To have a high degree of internal validity, experimenters strive for maximum control over extraneous variables. That is, they try to design experiments so that the independent variable is the only cause of differences between groups. But, of course, no study is ever perfect, and some degree of error is always in place. In many cases, errors are the result of unavoidable random causes, such as the health or mood of the participants on the day of the experiment. In other cases, errors are due to factors that are, in fact, within the experimenter’s control. This section focuses on several of these more manageable threats to internal validity and discusses strategies for reducing their influence.

ExperimentalExperimental Confounds Confounds To avoid threats to the internal validity of an experiment, it is important to control and minimize the influence of extraneous variables that might add noise to a hypothesis test. In many cases, extraneous variables can be considered relatively minor nuisances, as when the mood experiment was inadvertently run in a depressing room. Now, though, suppose we conduct our study on temperature and mood, and due to a lack of careful planning, accidentally place all of the “warm room” participants in a sunny room, and the “cool room” participants in a windowless room. We might very well find that the warm-room participants are in a much better mood. Still, is this the result of warm temperatures or the result of exposure to sunshine? Unfortunately, we would be unable to tell the difference because of a confounding variable (or confound)—a variable that changes systematically with the independent variable. In this example, room lighting is confounded with room temperature because all of the warm-room participants are also exposed to sunshine, and all of the cool-room participants are not. This confounding combination of variables leaves us unable to determine which variable actually has the effect on mood. In other words, because our groups differ in more than one way, we cannot clearly say that the independent variable of interest (the room) caused the dependent variable (mood) to change.

This observation may seem oversimplified, but the way to avoid confounds is to be very careful in designing experiments. By ensuring that groups are alike in every way but the experimental condition, an experimenter can generally prevent confounds. Nevertheless, avoiding confounds is somewhat easier said than done because they can come from unexpected places. For example, most studies involve the use of multiple research assistants who manage data collection and interact with participants. Some of these assistants might be more or less friendly than others, so it is important to make sure each of them interacts with participants in all conditions. The friendliest assistant’s always running participants in the warm- room group, for example, would result in a confounding variable (friendly versus unfriendly assistants) between room and research assistant. Consequently, the experimenter would be unable to separate the influence of the independent variable (the room) from that of the confound (the research assistant).

SelectionSelection Bias Bias

8/11/20, 12(37 PMPrint

Page 62 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Digital Vision/Photodisc/Thinkstock

Friendliness of the research assistant is a variable that can affect the outcome of an experiment.

Internal validity can also be threatened when groups differ before the manipulation, a condition known as selection bias. Selection bias causes problems because these preexisting differences might be the driving factor behind the results. Imagine someone is investigating a new program that will help people stop smoking. The experimenter might decide to ask for volunteers who are ready to quit smoking and put them through a six-week program. But by asking for volunteers—a remarkably common error—the researcher gathers a group of people who are already somewhat motivated to stop smoking. Thus, it is difficult to separate the effects of the new program from the effects of this preexisting motivation.

One easy way to avoid this problem is through either random or matched random assignment. In the stop-smoking example, a researcher could still ask for volunteers, but then randomly assign these volunteers either to the new program or to a control group. Both groups consisting of people motivated to quit smoking would help to cancel out the effects of motivation. Another way to minimize selection bias is to use the same people in both conditions so that they serve as their own control. In the stop-smoking example, the experimenter could assign volunteers first to one program and then to the other. However, this approach might present a problem: Participants who successfully quit smoking in the first program would not benefit from the second program. This technique is known as a within-subject design, and we will discuss its advantages and disadvantages in the section “Within-Subject Designs.”

DifferentialDifferential Attrition Attrition Despite researchers’ best efforts at random assignment, they could still have a biased sample at the end of a study as a result of differential attrition. The problem of differential attrition occurs when subjects drop out of experimental groups for different reasons. Say we are conducting a study of the effects of exercise on depression levels. We manage to randomly assign people either to one week of regular exercise or to one week of regular therapy. At first glance, it appears that the exercise group shows a dramatic drop in depression symptoms. But then we notice that about one-third of the people in this group dropped out before completing the study. Chances are we are left with the participants who are most motivated to exercise, to overcome their depression, or both. Thus, it is difficult to isolate the effects of the independent variable on depression symptoms. Although we cannot prevent people from dropping out of our study, we can look carefully at those who do. In many cases, researchers can spot a pattern and use it to guide future research. For example, it may be possible to create a profile of people who dropped out of the exercise study and use this knowledge to increase retention for the next attempt.

OutsideOutside Events Events As much as experimenters strive to control the laboratory environment, participants are often influenced by events in the outside world. These events—sometimes called history effects—are often large scale and include political upheavals and natural disasters. History effects threaten research because they make it difficult to tell whether participants’ responses are due to the independent variable or to the historical event(s). A paper published by social psychologist Ryan Brown, now a professor at the University of Oklahoma, offers a remarkable example. Brown et al.’s paper discussed the effects of receiving different types of affirmative action as people were selected for a leadership position. The goal was to determine the best way to frame affirmative action to avoid undermining the recipient’s confidence (Brown, Charnsangavej, Keough,

8/11/20, 12(37 PMPrint

Page 63 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Newman, & Rentfrow, 2000). For about a week during the data-collection process, students at the University of Texas where the study was being conducted protested on the school’s main lawn about a controversial lawsuit regarding affirmative-action policies. One side effect of these protests was that participants arriving for Brown’s study had to pass through a swarm of people holding signs that either denounced or supported affirmative action. These types of outside events are difficult, if not impossible, to control. But, because these researchers were aware of the protests, they made a decision to exclude data gathered from participants during the week of the protests from the study, thus minimizing the effects of outside events.

ExpectancyExpectancy Effects Effects One final set of threats to internal validity results from the influence of expectancies on people’s behavior. This influence can cause trouble for experimental designs in three related ways. First, experimenter expectancies can cause researchers to see what they expect to see, leading to subtle bias in favor of their hypotheses. In a clever demonstration of this phenomenon, the psychologist Robert Rosenthal asked his graduate students at Harvard University to train groups of rats to run a maze (Rosenthal & Fode, 1963). He also told them that based on a pretest, the rats had been classified as either bright or dull. As might be surmised, these labels were pure fiction, but they still influenced the way that the students treated the rats. Those labeled bright were given more encouragement and learned the maze much more quickly than rats labeled dull. Rosenthal later extended this line of work to teachers’ expectations of their students (Rosenthal & Jacobson, 1992) and found support for the same conclusion: People often bring about the results they expect by behaving in a particular way.

One common way to avoid experimenter expectancies is to have participants interact with a researcher who is “blind” to (i.e., unaware of) the condition in which each participant is. Blind researchers may be fully aware of the general research hypothesis, but their behavior is less likely to affect the results if they are unaware of the specific conditions. In the Rosenthal and Fode (1963) study, the graduate students’ behavior only influenced the rats’ learning speed because they were aware of the labels bright and dull. If these had not been assigned, the rats would have been treated fairly equally across the conditions.

Second, participants in a research study often behave differently based on their own expectancies about the goals of the study. These expectancies often develop in response to demand characteristics, or cues in the study that lead participants to guess the hypothesis. In a well-known study conducted at the University of Wisconsin, psychologists Leonard Berkowitz and Anthony LePage (1967) found that participants would behave more aggressively—by delivering electric shocks to another participant—if a gun was in the room than if there were no gun present. This finding has some clear implications for gun-control policies, suggesting that the mere presence of guns increases the likelihood of gun violence. However, a common critique of this study contends that participants may have quickly clued in to its purpose and figured out how they were “supposed” to behave. That is, the gun served as a demand characteristic, possibly making participants act more aggressively because they thought the researchers expected them to do so.

To minimize demand characteristics, researchers use a variety of techniques, all of which attempt to hide the true purpose of the study from participants. One common strategy is to use a cover story, or a misleading statement about what is being studied. Chapter 1 (1.3) discussed Milgram’s famous obedience studies, which discovered that people were willing to obey orders to deliver dangerous levels of electric shocks to other people. To disguise the purpose of the study, Milgram described it to participants as a study of punishment and learning. To give another example, Ryan Brown and colleagues (2000) presented their affirmative-action study as a study of leadership styles. These cover stories aimed to give participants a compelling explanation for what they experienced during the study and to direct their attention away from the research hypothesis.

Another strategy for avoiding demand characteristics is to use the unrelated-experiments technique, which leads participants to believe that they are completing two different experiments during one laboratory session. The experimenter can use this bit of deception to pre-sent the independent variable during the first experiment and then measure the

8/11/20, 12(37 PMPrint

Page 64 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Martin Poole/Digital Vision/Thinkstock

The placebo effect can test whether alcohol affects behavior, or whether people just expect it to and exhibit changed behavior based on their expectations.

dependent variable during the second experiment. For example, a study by Harvard psychologist Margaret Shih and colleagues (Shih, Pittinsky, & Ambady, 1999) recruited Asian-American females and asked them to complete two supposedly unrelated studies. In the first, they were asked to read and form impressions of one of two magazine articles; these articles were designed to make them focus on either their Asian-American identity or their female identity. In the second experiment, they were asked to complete a math test as quickly as possible. The goal of this study was to examine the effects of priming different aspects of identity on math performance. Based on previous research, these authors predicted that priming an Asian-American identity would remind participants of positive stereotypes regarding Asians and math performance, whereas priming a female identity would remind participants of negative stereotypes regarding women and math performance. As researchers expected, priming an Asian-American identity led this group of participants to do better on a math test than did priming a female identity. The unrelated-experiments technique was especially useful for this study because it kept participants from connecting the independent variable (magazine article prime) with the dependent variable (math test).

A final way in which expectancies shape behavior is the placebo effect, meaning that change can result from the mere expectation that change will occur. Imagine we want to test the hypothesis that alcohol causes people to become aggressive. One relatively easy way to do this would be to give alcohol to a group of volunteers (aged 21 and older) and then measure how aggressively they behave in response to being provoked. The problem with this approach is that people also expect alcohol to change their behavior, and so we might see changes in aggression simply because of these expectations. Fortunately, the problem has an easy solution: add a placebo control group to the study that mimics the experimental condition in every way but one. In this case, we might tell all participants that they will be drinking a mix of vodka and orange juice but only add vodka to half of the participants’ drinks. The orange-juice-only group serves as our placebo control. Any differences between this group and the alcohol group can be attributed to the alcohol itself.

External Validity

To have a high degree of external validity in experiments, researchers strive for maximum realism in the laboratory environment. External validity means that the results extend beyond the particular set of circumstances created in a single study. Recall that science is a cumulative discipline and that knowledge grows one study at a time. Thus, each study is more meaningful: 1) to the extent that it sheds light on a real phenomenon; and 2) to the extent that the results generalize to other studies. This section examines each of these criteria separately.

MundaneMundane Realism Realism The first component of external validity is the extent to which an experiment captures the real-world phenomenon under study. Inspired by a string of school shootings in the 1990s, one popular question in the area of aggression research asks whether rejection by a peer group leads to aggression. That is, when people are rejected from a group, do they lash out and behave aggressively toward the members of that group? Researchers must find realistic ways to manipulate rejection and measure aggression without infringing on participants’ welfare. Given the need to strike this balance, how real can conditions be in the laboratory? How do we study real-world phenomena without sacrificing internal validity?

8/11/20, 12(37 PMPrint

Page 65 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

The answer is to strive for mundane realism, meaning that the research replicates the psychological conditions of the real- world phenomenon (sometimes referred to as ecological validity). In other words, we need not recreate the phenomenon down to the last detail; instead, we aim to make the laboratory setting feel like the real-world phenomenon. Researchers studying aggressive behavior and rejection have developed some rather clever ways of doing this, including allowing participants to administer loud noise blasts or serve large quantities of hot sauce to those who reject them. Psychologically, these acts feel like aggressive revenge because participants are able to lash out against those who rejected them—with the intent of causing harm—even though the behaviors themselves may differ from the ways people exact revenge in the real world.

In a 1996 study, Tara MacDonald and her colleagues at Queen’s University in Ontario, Canada, examined the relationship between alcohol and condom use (MacDonald, Zanna, & Fong, 1996). The authors were intrigued by a puzzling set of real- world data: Most people self-reported that they would use condoms when engaging in casual sex, but actual rates of unprotected sex (i.e., having sexual intercourse without a condom) were also remarkably high. In this study, the authors found that alcohol was a key factor in causing “common sense to go out the window” (p. 763), resulting in a decreased likelihood of condom use. But how on earth might they study this phenomenon in the laboratory? In the authors’ words, “even the most ambitious of scientists would have to conclude that it is impossible to observe the effects of intoxication on actual condom use in a controlled laboratory setting” (p. 765).

To solve this dilemma, MacDonald and colleagues developed a clever technique for studying people’s intentions to use condoms. Participants were randomly assigned to either an alcohol or placebo condition, and then they viewed a video depicting a young couple faced with the dilemma of whether to have unprotected sex. At the key decision point in the video, the tape was stopped and participants were asked what they would do in the situation. As predicted, participants who were randomly assigned to consume alcohol said they would be more willing to proceed with unprotected sex. While this laboratory study does not capture the full experience of making decisions about casual sex, it does a pretty nice job of capturing the psychological conditions involved.

GeneralizingGeneralizing Results Results The second component of external validity, generalizability, refers to the extent to which the results extend to other studies by using a wide variety of populations and a wide variety of operational definitions (sometimes referred to as population validity). If we conclude that rejection causes people to become more aggressive, for example, this conclusion should ideally carry over to other studies of the same phenomenon, studies that use different ways of manipulating rejection and different ways of measuring aggression. If we want to conclude that alcohol reduces the intention to use condoms, we would need to test this relationship in a variety of settings—from laboratories to nightclubs—using different measures of intentions.

Thus, each single study researchers conduct is limited in its conclusions. For a particular idea to take hold in the scientific literature, it must be replicated, or repeated in different contexts. Replication can take one of four forms. First, exact replication involves trying to recreate the original experiment as closely as possible to verify the findings. This type of replication is often the first step following a surprising result, and it helps researchers to gain more confidence in the patterns.

The second and much more common method, conceptual replication, involves testing the relationship between conceptual variables using new operational definitions. Conceptual replications would include testing aggression hypotheses using new measures or examining the link between alcohol and condom use in different settings. For example, rejection might be operationalized in one study by having participants be chosen last for a group project. A conceptual replication might take a different approach, operationalizing rejection by having participants be ignored during a group conversation or voted out of the group. Likewise, a conceptual replication might change the operationalization of aggression, with one study measuring the delivery of loud blasts of noise and another measuring the amount of hot sauce that people give to their

8/11/20, 12(37 PMPrint

Page 66 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

rejecters. Each variation studies the same concept (aggression or rejection) but uses slightly different operationalizations. If all of these variations yield similar results, this further supports the underlying ideas—in this case, that rejection causes people to be more aggressive.

The third method, participant replication, involves repeating the study with a new population of participants. These types of replication are usually driven by a compelling theory as to why the two populations differ. For example, we might reasonably hypothesize that the decision to use condoms is guided by a different set of considerations among college students than among older, single adults. Or, we might hypothesize that different cultures around the world might have different responses to being rejected from a group.

Finally, constructive replication re-creates the original experiment but adds elements to the design. These additions are typically designed to either rule out alternative explanations or extend knowledge about the variables under study. Our rejection and aggression example might compare the impact of being rejected by a group versus by an individual.

Internal Versus External Validity

This chapter has focused on two ways to assess validity in the context of experimental designs. Internal validity assesses the degree to which results can be attributed to independent variables; external validity assesses how well results generalize beyond the specific conditions of the experiment. In an ideal world, studies would have a high degree of both of these. That is, we would feel completely confident that our independent variable was the only cause of differences in our dependent variable, and our experimental paradigm would perfectly capture the real-world phenomenon under study.

Reality, though, often demands a trade-off between internal and external validity. In MacDonald et al.’s (1996) study on condom use, the researchers sacrificed some realism in order to conduct a tightly controlled study of participants’ intentions. In Berkowitz and LePage’s (1967) study on the effect of weapons, the researchers risked the presence of a demand characteristic in order to study reactions to actual weapons. These types of trade-offs are always made based on the goals of the experiment.

Research: Applying Concepts

Balancing Internal Versus External Validity

To give you a better sense of how researchers make the compromises involving internal and external validity, consider the following fictional scenarios.

ScenarioScenario 1—Time 1—Time Pressure Pressure and and Stereotyping Stereotyping

Dr. Bob is interested in whether people are more likely to rely on stereotypes when they are in a hurry. In a well- controlled laboratory experiment, he asks participants to categorize ambiguous shapes as either squares or circles, and half of these participants are given a short time limit to accomplish the task. The independent variable is the presence or absence of time pressure, and the dependent variable is the extent to which people use stereotypes in their classification of ambiguous shapes. Dr. Bob hypothesizes that people will be more likely to use stereotypes when they are in a hurry because they will have fewer cognitive resources to consider carefully all aspects of the situation. Dr. Bob takes great care to have all participants meet in the same room. He uses the same research assistant every time, and the study is always conducted in the morning. Consistent with his hypothesis, Dr. Bob finds that people seem to use shape stereotypes more under time pressure.

8/11/20, 12(37 PMPrint

Page 67 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

The internal validity of this study appears high—Dr. Bob has controlled for other influences on participants’ attention span by collecting all of his data in the morning. He has also minimized error variance by using the same room and the same research assistant. In addition, Dr. Bob has created a tightly controlled study of stereotyping through the use of circles and squares. Had he used photographs of people (rather than shapes), the attractiveness of these people might have influenced participants’ judgments. The study, however, has a trade-off: By studying the social phenomenon of stereotyping using geometric shapes, Dr. Bob has removed the social element of the study, thereby posing a threat to mundane realism. The psychological meaning of stereotyping shapes is rather different from the meaning of stereotyping people, which makes this study relatively low in external validity.

ScenarioScenario 2—Hunger 2—Hunger and and Mood Mood

Dr. Jen is interested in the effects of hunger on mood; not surprisingly, she predicts that people will be happier when they are well fed. She tests this hypothesis with a lengthy laboratory experiment, requiring participants to be confined to a laboratory room for 12 hours with very few distractions. Participants have access to a small pile of magazines to help pass the time. Half of the participants are allowed to eat during this time, and the other half is deprived of food for the full 12 hours. Dr. Jen—a naturally friendly person—collects data from the food-deprivation groups on a Saturday afternoon, while her grumpy research assistant, Mike, collects data from the well-fed group on a Monday morning. Her independent variable is food deprivation, with participants either not deprived of food or deprived for 12 hours. Her dependent variable consists of participants’ self-reported mood ratings. When Dr. Jen analyzes the data, she is shocked to discover that participants in the food-deprivation group are much happier than those in the well-fed group.

Compared to our first scenario, this study seems high on external validity. To test her predictions about food deprivation, Dr. Jen actually deprives her participants of food. One possible problem with external validity is that participants are confined to a laboratory setting during the deprivation period with only a small pile of magazines to read. That is, participants may be more affected by hunger when they do not have other things to distract them. In the real world, people are often hungry but distracted by paying attention to work, family, or leisure activities. Dr. Jen, though, has sacrificed some external validity for the sake of controlling how participants spend their time during the deprivation period. The larger problem with her study has to do with internal validity. Dr. Jen has accidentally confounded two additional variables with her independent variable: Participants in the deprivation group have a different experimenter and data are collected at a different time of day. Thus, Dr. Jen’s surprising results most likely reflect that everyone is in a better mood on Saturday than on Monday and that Dr. Jen is more pleasant to spend 12 hours with than Mike is.

ScenarioScenario 3—Math 3—Math Tutoring Tutoring and and Graduation Graduation Rates Rates

Dr. Liz is interested in whether specialized math tutoring can help increase graduation rates among female math majors. To test her hypothesis, she solicits female volunteers for a math-skills workshop by placing fliers around campus, as well as by sending email announcements to all math majors. The independent variable is whether participants are in the math skills workshop, and the dependent variable is whether participants graduate with a math degree. Those who volunteer for the workshop are given weekly skills tutoring, along with informal discussion groups designed to provide encouragement and increase motivation. At the end of the study, Liz is pleased to see that participants in the workshops are twice as likely as nonparticipants to stick with the major and graduate.

The obvious strength of this study is its external validity. Dr. Liz has provided math tutoring to math majors, and she has observed a difference in graduation rates. Thus, this study is very much embedded in the real world. However, this external validity comes at a cost to internal validity. The study’s biggest flaw is that Dr. Liz has recruited volunteers for her workshops, resulting in selection bias for her sample. People who volunteer for extra

8/11/20, 12(37 PMPrint

Page 68 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

math tutoring are likely to be more invested in completing their degree and might also have more time available to dedicate to their education. Dr. Liz would also need to be mindful of how many people drop out of her study. If significant numbers of participants withdraw, she could have a problem with differential attrition, so that the most motivated people stayed with the workshops. Dr. Liz can fix this study with relative ease by asking for volunteers more generally and then randomly assigning these volunteers to take part in either the math tutoring workshops or a different type of workshop. While the sample might still be less than random, Dr. Liz would at least have the power to assign participants to different groups.

8/11/20, 12(37 PMPrint

Page 69 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

5.4 Experimental Design The process of designing experiments boils down to deciding what to manipulate and how to do it. This section covers two broad issues related to experimental design: deciding how to structure the levels, or different versions of an independent variable, and deciding on the number of independent variables necessary to test the hypotheses. While these decisions may seem tedious, they are at the crux of designing successful experiments, and are, therefore, the key to performing successful tests of hypotheses.

Levels of the Independent Variable

The primary goal in designing experiments is to ensure that the levels of independent variables are equivalent in every way but one. This is what allows researchers to make causal statements about the effects of that single change. These levels can be formed in one of two broad ways: representing two distinct groups of people or representing the same group of people over time.

Between-SubjectBetween-Subject Designs Designs In most of the examples discussed so far, the levels of independent variables have represented two distinct groups— participants are in either the control group or the experimental group. This type of design is referred to as a between- subject design because the levels differ between one subject and the next. Each participant who enrolls in the experiment is exposed to only one level of the independent variable—for example, either the experimental or the control group. Most of the examples so far have been illustrations of between-subject designs: participants receive either alcohol or a placebo; students read an article designed to prime either their Asian or their female identity; and graduate students train rats that are falsely labeled either bright or dull. The “either-or” between-subject approach is common and has the advantage of using distinct groups to represent each level of the independent variable. In other words, participants who are asked to consume alcohol are completely distinct from those asked to consume the placebo drink. However, the between-subject approach is only one option for structuring the levels of the independent variable. This section examines two additional ways to structure these levels.

Within-SubjectWithin-Subject Designs Designs In some cases, the levels of the independent variable can represent the same participants at different time periods. This type of design is referred to as a within-subject design because the levels differ within individual participants. Each participant who enrolls in the experiment would be exposed to all levels of the independent variable. That is, every participant would be in both the experimental and the control group. Within-subject designs are often used to compare changes over time in response to various stimuli. For example, a researcher might measure anxiety symptoms before and after people are locked in a room with a spider, or measure depression symptoms before and after people undergo drug treatment.

Within-subject designs have two main advantages over between-subject designs. First, because the same people constitute both levels of the IV, these designs require fewer participants. Suppose we decide to collect data from 20 participants at each level of an IV. In a between-subject design with three levels, we would need 60 people. However, if we run the same experiment as a within-subject design—exposing the same group of people to three different sets of circumstances—we would need only 20 people. Thus, within-subject designs are often a good way to conserve resources.

Second, participants also serve as their own control group, allowing the researcher to minimize a major source of error variance. Remember that one key feature of experimental design is the researcher’s power to assign people to groups to distribute subject differences randomly across the levels of the IV. Using a within-subject design solves the problem of subject differences in another way, by examining changes within people. For instance, in the study of spiders and anxiety,

8/11/20, 12(37 PMPrint

Page 70 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Wavebreakmedia/iStock/Thinkstock

Carryover effects can be understood through the example of monitoring people’s reactions to different film clips. How they feel about one image may influence how they react to the next image.

some participants are likely to have higher baseline anxiety than others. By measuring changes in anxiety in the same group of people before and after spider exposure, we are able to minimize the effects of individual differences.

DisadvantagesDisadvantages of of Within-Subject Within-Subject Designs Designs Within-subject designs also have two clear disadvantages compared to between-subject designs. First, they pose the risk of carryover effects, in which the effects of one level are still present when another level is introduced. Because the same people are exposed to all levels of the IV, it can be difficult to separate the effects of one level from the effects of the others. One common paradigm in emotion research is to show participants several film clips that elicit different types of emotion. People might view one clip showing a puppy playing with a blanket, another showing a child crying, and another showing a surgical amputation. Even without seeing these clips in full color, we can imagine that it would be hard to shake off the disgust triggered by the amputation to experience the joy triggered by the puppy.

When researchers use a within-subject design, they take steps to minimize carryover effects. In studies of emotion, for example, researchers typically show a brief neutral clip—like waves rolling onto a beach—after each emotional clip, so that participants experience each emotion after viewing a benign image. Another simple technique is to collect data from the baseline control condition first whenever possible. In the study of spiders and anxiety, it would be important to measure baseline anxiety at the start of the experiment before exposing people to spiders. Once people have been surprised by a spider, it will be hard to get them to relax enough to collect control ratings of anxiety.

Second, within-subject designs risk order effects, meaning that the order in which levels are presented can moderate their effects. Order effects fall into two categories. The practice effect happens when participants’ performance improves over time simply due to repeated attempts. This is a particular problem in studies that examine learning. Say we use a within-subject design

to compare two techniques for teaching people to solve logic problems. Participants would learn technique A, then take a logic test, then learn technique B, and then take a second logic test. The possible problem is that participants will have had more opportunities to practice logic problems by the time they take the second test. This makes it difficult to separate the effects of practicing the logic problems from the effects of using different teaching techniques.

The flipside of practice effects is the phenomenon of the fatigue effect, which happens when participants’ performance decreases over time due to repeated testing. Imagine running a variation of the above experiment, teaching people different ways to improve their reaction time. Participants might learn each technique and have their reaction time tested several times after each one. The problem is that people gradually start to tire, and their reaction times slow down due to fatigue. Thus, it would be difficult to separate the effects of fatigue from the effects of the different teaching techniques.

The result of both types of order effects is in confounding the order of presentation with the level of the independent variable. Fortunately, researchers have a relatively easy way to avoid both carryover and fatigue effects: a process called counterbalancing. Counterbalancing involves varying the order of presentation to groups of participants. The simplest approach is to divide participants into as many groups as combinations of levels in the experiment. That is, we create a group for each possible order, allowing us to identify the effects of encountering the conditions in different orders. In the examples above, the learning experiments involved two techniques, A and B. To counterbalance these techniques across

8/11/20, 12(37 PMPrint

Page 71 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

the study, we divide the participants into two groups. We expose one group to A and then B; we expose the other group to B and then A. When it is time to analyze the data, we will be able to examine the effects of both presentation order and teaching technique. If the order of presentation made a difference, then the A/B group would differ from the B/A group in some way.

MixedMixed Designs Designs The third common way to structure the levels of an IV is using a mixed design, which contains at least one between- subject variable and at least one within-subject variable. So, in the previous example, participants would be exposed to both teaching techniques (A and B) but in only one order of presentation. In this case, teaching technique is a within- subject variable because participants experience both levels, and presentation order is a between-subject variable because participants experience only one level. Because we have one of each in the overall experiment, it is a mixed design.

Studies that compare the effects of different drugs commonly use mixed designs. Imagine we want to compare three new drugs—Drug X, Drug Y, and a placebo control—to determine which has the strongest effects on reducing depression symptoms. To perform this study, we would want to measure depression symptoms on at least three occasions: before starting drug treatment, after a few months of taking the drug, and then again after a few months of stopping the drug (to assess relapse rates). So, our participants would be given one of three possible drugs and then measured at each of three time periods. In this mixed design, measurement time is a within-subject variable because participants are measured at all possible times, while the drug is a between-subject variable because participants experience only one of three possible drugs.

Figure 5.4 shows the hypothetical results of this study. Observe that the placebo pill has no effect on depression symptoms; depression scores in this group are the same at all three measurements. Drug X appears to cause significant improvement in depression symptoms; depression scores drop steadily across measurements in this group. Strangely, Drug Y seems to make depression worse; depression scores increase steadily across measurements in this group. The mixed design allows us both to track people over time and to compare different drugs in one study.

Figure 5.4: Example of a mixed-subjects design

8/11/20, 12(37 PMPrint

Page 72 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Research: Thinking Critically

Outwalking Depression

Follow the link below to an article from Psychology Today, describing a 2011 research study from the Journal of Psychiatric Research. The study provides new evidence of the benefits of exercise for people with depression. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.

https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression (https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression)

Think About It

1. Identify the following essential aspects of this experimental design:

a) What are the IV and DV in this study? b) How many levels does the IV have? c) Is this a between-subjects, within-subjects, or mixed design? d) Draw a simple table labeling each condition.

2. a) What preexisting differences between groups should the researchers be sure to take into account? Name as many as you can. b) How should the researchers assign participants to the conditions in order to ensure that preexisting

https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression

8/11/20, 12(37 PMPrint

Page 73 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

differences cannot account for the results?

3. How might expectancy effects influence the results of this study? Can you think of any ways to control for this?

4. Briefly state how you would replicate this study in each of the following ways:

a) exact replication b) conceptual replication c) participant replication d) constructive replication

One-Way Versus Factorial Designs

The second big issue in creating experimental designs is to decide how many independent variables to manipulate. In some cases, we can test our hypotheses by manipulating a single IV and measuring the outcome—such as giving people either alcohol or a placebo drink and measuring the intention to use condoms. In other cases, hypotheses involve more complex combinations of variables. Earlier, the chapter discussed research findings that people tend to act more aggressively after a peer group has rejected them—a single independent variable. Researchers could, however, extend this study and ask what happens when people are rejected by members of the same sex versus members of the opposite sex. We could go one step further and test whether the attractiveness of the rejecters matters, for a total of three independent variables. These examples illustrate two broad categories of experimental design, known as one-way and factorial designs.

One-WayOne-Way Designs Designs If a study involves assigning people to either an experimental or control group and measuring outcomes, it has a one-way design, or a design that has only one independent variable with two or more levels to the variable. These tend to be the simplest experiments and have the advantage of testing manipulations in isolation. The majority of drug studies use one- way designs. These types of study compare the effects on medical outcomes for people randomly assigned, for instance, to take the antidepressant drug Prozac or a placebo. Note that a one-way design can still have multiple levels—in many cases it is preferable to test several different doses of a drug. So, for example, we might test the effects of Prozac by assigning people to take doses of 5 mg, 10 mg, 20 mg, or a placebo control. The independent variable would be the drug dose, and the dependent variable would be a change in depression symptoms. This one-way design would allow us to compare all three of the drug doses to a placebo control, as well as to test the effects of varying doses of the drug. Figure 5.5 shows hypothetical results from this study. We can see that even those receiving the placebo showed a drop in depression symptoms, with the 10-mg dose of Prozac producing the maximum benefit.

Figure 5.5: Comparing drug doses in a one-way design

8/11/20, 12(37 PMPrint

Page 74 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

FactorialFactorial Designs Designs Despite the appealing simplicity of one-way designs, experiments conducted in the field of psychology with only one IV are relatively rare. The real world is much more complicated, so studies that focus on people’s thoughts, feelings, and behaviors must somehow capture this complexity. Thus, the rejection-and-aggression example above is not that farfetched. If a researcher wanted to manipulate the occurrence of rejection, the gender of the rejecters, and the attractiveness of the rejecters in a single study, the experiment would have a factorial design. Factorial designs are those that have two or more independent variables, each of which has two or more levels. When experimenters use a factorial design, their purpose is to observe both the effects of individual variables and the combined effects of multiple variables.

Factorial designs have their own terminology to reflect the fact that they include both individual variables and combinations of variables. The beginning of this chapter explained that the versions of an independent variable are referred to as both levels and conditions, with a subtle difference between the two. This difference becomes relevant to the discussion of factorial designs. Specifically, levels refer to the versions of each IV, while conditions refer to the groups formed by combinations of IVs. Consider one variation of the rejection-and-aggression example from this perspective: The first IV has two levels because participants are either rejected or not rejected. The second IV also has two levels because members of the same sex or the opposite sex do the rejecting. To determine the number of conditions in this study, we calculate the number of different experiences that participants can have in the study. This is a simple matter of multiplying the levels of separate variables, so two multiplied by two, for a total of four conditions.

Researchers also have a way to quickly describe the number of variables in their design: A two-way design has two independent variables; a three-way design has three independent variables; an eight-way design has eight independent variables, and so on. Even more useful, the system of factorial notation offers a simple way to describe both the number of variables and the number of levels in experimental designs. For instance, we might describe our design as a 2 × 2 (pronounced “two by two”), which instantly communicates two things: (1) the study uses two independent variables, indicated by the presence of two separate numbers and (2) each IV has two levels, indicated by the number 2 listed for each one.

TheThe 2 2 × × 2 2 Design Design

One of the most common factorial designs also happens to be the simplest one—the 2 × 2 design. As noted above, these

8/11/20, 12(37 PMPrint

Page 75 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Figure 5.6: Sample 2 × 2 design: Results

designs have two independent variables, with two levels each, for a total of four experimental conditions. The simplicity of these designs makes them a useful way to become more comfortable with some of the basic concepts of experiments. This section will explore an example of a 2 × 2 and analyze it in detail.

Beginning in the late 1960s, social psychologists developed a keen interest in understanding the predictors of helping behavior. This interest was inspired, in large part, by the tragedy of Kitty Genovese, who was killed outside her apartment building while none of her neighbors called the police (Gansberg, 1964). As Chapter 2 (2.1) discussed, in one representative study, Princeton psychologists John Darley and Bibb Latané examined people’s likelihood of responding to a staged emergency. Participants were led to believe that they were taking part in a group discussion over an intercom system, but in reality, all of the other participants were prerecorded. The key independent variable was the number of other people supposedly present, ranging from two to six. A few minutes into the conversation, one participant appeared to have a seizure. The recording went like this (actual transcript; Darley & Latané, 1968):

I could really-er-use some help so if someone would-er-give me a little h-hel-puh-er-er-er c-could somebody er- er-hel-er-uh-uh-uh [choking sounds] . . . I’m gonna die-er-er-I’m . . . gonna die-er-hel-er-er-seizure-er [chokes, then quiet].

What do people do in this situation? Do they help? How long does it take? Darley and Latané discovered that two things happen as the group became larger: People were less likely to help at all, and those who did help took considerably longer to do so. Researchers concluded from this and other studies that people are less likely to help when other people are present because the responsibility for helping is “diffused” among the members of the crowd (Darley & Latané, 1968).

Building on this earlier conclusion, the sociologist Jane Piliavin and her colleagues (Piliavin, Piliavin, & Rodin, 1975) explored the influence of two additional variables on helping behavior. The experimenters staged an emergency on a New York City subway train in which a person who was in on the study appeared to collapse in pain. Piliavin and her team manipulated two variables in their staged emergency. The first independent variable was the presence or absence of a nearby medical intern, who could be easily identified in blue scrubs. The second independent variable was the presence or absence of a large disfiguring scar on the victim’s face. The combination of these variables resulted in four conditions, as Table 5.2 shows. The dependent variable in this study was the percentage of people taking action to help the confederate.

Table 5.2: 2 × 2 Design of the Piliavin et al. study

No intern Intern

No scar 1 2

Scar 3 4

The authors predicted that bystanders would be less likely to help if a perceived medical professional was nearby since he or she was considered more qualified to help the victim. They also predicted that people would be less likely to help when the confederate had a large scar because previous research had demonstrated convincingly that people avoid contact with those who are disfigured or have other stigmatizing conditions (e.g., Goffman, 1963). As Figure 5.6 reveals, the results supported these hypotheses. Both the presence of a scar and the presence of a perceived medical professional reduced the percentage of people who came to help. Nevertheless, something else is apparent in these results: When the confederate was not scarred, having an intern nearby led to a small decrease in helping (from 88% to 84%). However, when the confederate had a large facial scar, having an intern nearby decreased helping from 72% to 48%. In other words, it seems these variables are having a combined effect on helping behavior. The next section examines these combined effects more closely.

8/11/20, 12(37 PMPrint

Page 76 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

from Piliavan et al. (1975)

PiliavanPiliavan et et al. al. (1975) (1975)

Main Effects and Interactions

When experiments involve only one independent variable, the analyses can be as simple as comparing two group means—as did the example in Chapter 1, which compared the happiness levels of couples with and without children. But what about cases where the design has more than one independent variable?

A factorial design has two types of effects: A main effect refers to the effect of each independent variable on the dependent variable, averaging values across the levels of other variables. A 2 × 2 design has two main effects; a 2 × 2 × 2 design has three main effects because there are three IVs. An interaction occurs when the variables have a combined effect; that is, the effects of one IV are different depending on the levels of the other IV. So, applying this new terminology to the Piliavin et al. (1975) “subway emergency” study, produces three possible results (“possible,” because we would need to use statistical analyses to verify them):

1. The main effect of scar: Does the presence of a scar affect helping behavior? Yes. More people help in absence of a facial scar. Figure 5.6 indicates that the bars on the left (no scar) are, on average, higher than those on the right (scar).

2. The main effect of intern: Does the presence of an intern affect helping behavior? Yes. More people help when no medical intern is on hand. Note that in Figure 5.6, the red bars (no intern) are, on average, higher than the tan bars (intern).

3. The interaction between scar and intern: Does the effect of one variable depend on the effect of another variable? Yes. Refer to Figure 5.6 and observe that the presence of a medical intern matters more when the victim has a facial scar. In visual terms, the gap between red and tan bars is much larger in the bars on the right. This indicates an interaction between scar and intern.

Consider a fictional example. Imagine we are interested in people’s perceptions of actors in different types of movies. We might predict that some actors are better suited to comedy and others are better suited to action movies. A simple experiment to test this hypothesis would show four movies in a 2 × 2 design, using the same two actors in two movies (for a total of four conditions). The first IV would be the movie type, with two levels: action and comedy. The second IV would be the actor, with two levels: Will Smith and Arnold Schwarzenegger. The dependent variable would be the ratings of each movie on a 10-point scale. This design produces three possible results:

1. The main effect of actor: Do people generally prefer Will Smith or Arnold Schwarzenegger, regardless of the movie?

2. The main effect of movie type: Do people generally prefer action or comedy movies, regardless of the actor? 3. The interaction between actor and movie type: Do people prefer each actor in a different kind of movie? (i.e., are

ratings affected by the combination of actor and movie type?)

After collecting data from a sample of participants, we end up with the following average ratings for each movie, which Table 5.3 shows.

8/11/20, 12(37 PMPrint

Page 77 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Table 5.3: Main effects and marginal means: the actor study

Remember that main effects represent the effects of one IV, averaging across the levels of the other IV. To average across levels, we calculate the marginal means, or the combined mean across levels of another factor. In other words, the marginal mean for action movies is calculated by averaging together the ratings of both Arnold Schwarzenegger and Will Smith in action movies. The marginal mean for Arnold Schwarzenegger is calculated by averaging together ratings of Arnold Schwarzenegger in both action and comedy movies. Performing these calculations for our 2 × 2 design results in four marginal means, which are presented alongside the participant ratings in Table 5.3. To verify these patterns would require statistical analyses, but it appears that people have a slight preference for comedy over action movies, as well as a slight preference for Arnold Schwarzenegger’s acting over Will Smith’s acting.

What about the interaction? The main hypothesis here posits that some actors perform best in some genres of movies (e.g., action or comedy) than they do in other genres, which suggests that the actor and the movie type have a combined effect on people’s ratings of the movies. Examining the means in Table 5.3 conveys a sense of this finding, but it is much easier to appreciate in a graph. Figure 5.7 shows the mean of participants’ ratings across the four conditions. If we focus first on the ratings of Arnold Schwarzenegger, we can see that participants did have a slight preference for him in action (6) versus comedy (5) roles. Then, examining ratings of Will Smith, we can see that participants had a strong preference for him in comedy (8) versus action (1.5) roles. Together, this set of means indicates an interaction between actor and movie type because the effects of one variable depend on another. In plain English: People’s perceptions of an actor depend on the type of movie in which he or she performs. This pattern of results nicely fits for the hypothesis that certain actors are better suited to certain types of movie: Arnold should probably stick to action movies, and Will should definitely stick to comedies.

Figure 5.7: Interaction in the actor study

8/11/20, 12(37 PMPrint

Page 78 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Before moving on to the logic of analyzing experiments, consider one more example from a published experiment. A large body of research in social psychology suggests that stereotypes can negatively affect performance on cognitive tasks (e.g., tests of math and verbal skills). According to Stanford social psychologist Claude Steele and his colleagues, individuals’ fear of confirming negative stereotypes about their group acts as a distraction. This distraction—which the researchers term stereotype threat—makes it hard to concentrate and perform well, and thus leads to lower scores on a cognitive test (Steele, 1997). One of the primary implications of this research is that ethnic differences in standardized-test scores can be viewed as a situational phenomenon—change the situation, and the differences go away. In the first published study of stereotype threat, Claude Steele and Josh Aronson (1995) found that when African-American students at Stanford were asked to indicate their race before taking a standardized test, this was enough to remind them of negative stereotypes, and they performed poorly. When the testing situation was changed, however, and participants were no longer asked their race, the students performed at the same level as Caucasian students. Worth emphasizing is that these were Stanford students and had therefore met admissions standards for one of the best universities in the nation. Even this group of elite students was susceptible to situational pressure but performed at their best when the pressure was eliminated.

In a great application of stereotype threat, social psychologist Jeff Stone at the University of Arizona asked both African- American and Caucasian college students to try their hands at putting on a golf course (Stone, Lynch, Sjomeling, & Darley, 1999). Putting was described as a test of natural athletic ability to half of the participants and as a test of sports intelligence to the other half. Thus, the experiment had two independent variables: the race of the participants (African-American or Caucasian) and the description of the task (“athletic ability” or “sports intelligence”). Note that “race” in this study is technically a quasi-independent variable because it is not manipulated. This design resulted in a total of four conditions, and the dependent variable was the number of putts that participants managed to make. Stone and colleagues hypothesized that describing the task as a test of athletic ability would lead Caucasian participants to worry about the stereotypes regarding their poor athletic ability. In contrast, describing the task as a test of intelligence would lead African-American participants to worry about the stereotypes regarding their lower intelligence.

Consistent with their hypotheses, Stone and colleagues found an interaction between race and task description but no main effects. That is, neither race was better at the putting task overall, and neither task description had an overall effect on putting performance. The combination of these variables, though, proved fascinating. When researchers described the task as measuring sports intelligence, the African-American participants did poorly due to fear of confirming negative

8/11/20, 12(37 PMPrint

Page 79 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Comstock Images/Stockbyte/Thinkstock

Skill on the golf course was used to study stereotypes in an experiment conducted by Jeff Stone at the University of Arizona.

stereotypes about their overall intelligence. Conversely, when researchers described the task as measuring natural athletic ability, the Caucasian participants did poorly due to fear of confirming negative stereotypes about their athleticism. This study beautifully illustrates an interaction; the effects of one variable (task description) depend on the effects of another (race of participants). The results further confirm the power of the situation: Neither group did better or worse overall, but both were responsive to a situationally induced fear of confirming negative stereotypes.

8/11/20, 12(37 PMPrint

Page 80 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Figure 5.8: Comparing sources of variance

5.5 Analyzing Data From Experiments So far, we have been drawing conclusions about experimental findings using conceptual terms. But naturally, before we actually make a decision about the status of our hypotheses, we have to conduct statistical analyses. This section provides a conceptual overview of the most common statistical techniques for analyzing experimental data.

Dealing With Multiple Groups

Why do researchers need a special technique for experimental designs? After all, we learned in Chapter 2 (2.4) that we can compare two pairs of means using a t test; why not use several t tests to analyze our experimental designs? For the movie ratings study, we could analyze the data using a total of six t tests to capture every possible pair of means:

Arnold Schwarzenegger in a comedy versus Will Smith in a comedy; Arnold Schwarzenegger in an action movie versus Will Smith in an action movie; Arnold Schwarzenegger in a comedy versus an action movie; Will Smith in a comedy versus an action movie; Will Smith in a comedy versus Arnold Schwarzenegger in an action movie; and finally Will Smith in an action movie versus Arnold Schwarzenegger in a comedy.

This approach, however, presents a problem. The odds of making a Type I error (getting excited about a false positive) increase with every statistical test. Researchers typically set their alpha level at 0.05 for a t test, meaning that they are comfortable with a 5% chance of a Type I error. Unfortunately, if we conduct six t tests, each one has a 5% chance of a Type I error, meaning that we have a greater chance of a false-positive result somewhere in the study. In short, we need a statistical approach that reduces the number of comparisons we perform. Fortunately, a statistical technique called the analysis of variance (ANOVA) tests for differences by comparing the amount of variance explained by the independent variables to the variance explained by error.

The Logic of ANOVA

The logic behind an analysis of variance is rather straightforward. As the course has discussed throughout, variability in a dataset can be divided into systematic and error variance. That is, we can attribute some of the variability to the factors being studied, but a degree of random error will always be present. In our movie ratings study, some of the variability in these ratings can be attributed to the independent variables (differences in actors and movie types), while some of the variability is due to other factors—perhaps some people simply like movies more than other people.

The ANOVA works by comparing the influence of these different sources of variance. We always want to explain as much of the variance as possible through the independent variables. If the independent variables have more influence than random error does, this is good news. If, on the other hand, error variance has more influence than the independent variables, this is bad news for the hypotheses. Comparing the three pie charts in Figure 5.8 conveys a sense of this problem. The proportion of variance explained by our independent variables is shaded in tan, while the proportion explained by error is

8/11/20, 12(37 PMPrint

Page 81 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

shaded in red. In the top graph, the independent variables explain approximately 80% of the variance, which we can view as a good result. In the middle graph, however, variance is explained equally by the independent variables and by error, and in the bottom graph, the independent variables explain only 20% of the variance. Thus, in the latter two graphs, the independent variables do no better than random error at explaining the results.

One more analogy may be helpful. In the field of engineering, the term signal-to-noise ratio is used to describe the amount of light, sound, energy, etc., that is detectable above and beyond background noise. This ratio is high when the signal comes through clearly and low when it is mixed with static or other interference. Likewise, when someone tries to tune in a favorite radio station, the goal is to find a clear signal that is not covered up by static. Believe it or not, the ANOVA statistic (symbolized F) is doing the same thing. That is, the analysis tells us whether differences in experimental conditions (signal) are detectable above and beyond error variance (noise).

Research: Thinking Critically

Love Ballad Leaves Women More Open to a Date

Follow the link below to a press release describing a 2010 study from the journal Psychology of Music. The study suggests that listening to love ballads may make women more likely to give their phone number to someone they have just met. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.

http://www.sciencedaily.com/releases/2010/06/100618112139.htm (http://www.sciencedaily.com/releases/2010/06/100618112139.htm)

Think About It

1. In this experiment, the type of song (love song or neutral song) is confounded with at least one other variable. Try to identify one. Do you think that this confounded variable would make a difference? How would you design a study that overcomes this?

2. Describe how demand characteristics might compromise the internal validity of this study. Can you think of any ways around this?

3. Toward the end of the article, the authors suggest that one explanation for these results is that the romantic music put the women into a more positive mood, and that this in turn made them more receptive to the men. How could you design a study that tests this hypothesis?

4. Given the nature of the DV in this study, would an ANOVA test be appropriate? What would be the more

http://www.sciencedaily.com/releases/2010/06/100618112139.htm

8/11/20, 12(37 PMPrint

Page 82 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

appropriate statistical test, and why?

Exploring the Data

Statistics courses cover ANOVA in more detail, but, despite its elegant simplicity, the test has a notable limitation. After conducting an ANOVA, we have a yes-or-no answer to the following question: Do our experimental groups have a systematic effect on the dependent variable? The answer lets us decide whether to reject the null hypothesis, but it does not tell us everything we want to know about the data. In essence, a significant F value tells us that the groups have a significant difference, but it does not tell us what the difference is. Conducting an ANOVA on our movie-ratings study would reveal a significant interaction between actor and movie, but we would need to take additional steps to determine the meaning of this interaction.

This section will describe the process of exploring and interpreting ANOVA results to make sense of the data. The example is drawn from a published study by Newman, Sellers, and Josephs (2005), which was designed to explore the effects of testosterone on cognitive performance. Previous research had suggested that testosterone was involved in two types of complex human behavior. On one hand, people with higher testosterone tend to perform better on tests of spatial skills, such as having to rotate objects mentally, and perform worse on tests of verbal skills, such as listing all the synonyms for a particular word. These patterns are thought to reflect the influence of testosterone on developing brain structures. On the other hand, people with higher testosterone are also more concerned with gaining and maintaining high status relative to other people. Testosterone correlates with a person’s position in the hierarchy and tends to rise and fall when people win and lose competitions, respectively. Sociologist Alan Mazur and his colleagues measured testosterone levels before, during, and after a series of professional chess matches. They found that testosterone rose in both players in anticipation of the competition, then rose even further in the winners, but plummeted in the losers (Mazur, Booth, & Dabbs, 1992).

Newman and colleagues (2005) set out to test the combination of these variables. Based on previous research, they hypothesized that people with higher testosterone would be uncomfortable when they were placed in a low-status position, leading them to perform worse on cognitive tasks. The researchers tested this hypothesis by randomly assigning people to a high status, low status, or control condition, and then administering a spatial and a verbal test. The resulting between- subjects design was a 2 (testosterone: high or low) × 3 (condition: high status, low status, control), for a total of six groups. Note that “testosterone” in this study is a quasi-independent variable, because it is measured rather than manipulated by the experimenters.

Once the results were in, the ANOVA revealed an interaction between testosterone and status but no main effects. Figure 5.9 shows the results of the study. These bars represent z scores that combine the spatial and verbal tests into one number. So, what do these numbers mean? How do we make sense out of the patterns? Doing so involves a combination of comparing means and calculating effect sizes, as we discuss next.

Figure 5.9: Exploring the data: Results from Newman et al. (2005)

8/11/20, 12(37 PMPrint

Page 83 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

NewmanNewman et et al. al. (2005) (2005)

MeanMean Comparisons Comparisons The first step in interpreting results is to compare the various pairs of means within the design. This might seem counterintuitive, since the whole point of the ANOVA was to test for effects without comparing individual means. Our goal, therefore, is to somehow explore differences in conditions without inflating Type I error rates. Achieving this balance involves two strategies.

Planned comparisons (also called a priori comparisons) involve comparing only the means for which differences were predicted by the hypothesis. In the experiment by Newman et al. (2005), the hypothesis explicitly stated that high- testosterone people should perform better in a high-status position than a low-status position. So, a planned comparison for this prediction would involve comparing two means with a t test: high T, high status (the highest red bar); and high T, low status (the lowest tan bar). Consistent with the researchers’ hypothesis, high-testosterone people did perform higher on both tests, t(27) = 2.35, p = 0.01, but only in a high-status position. Type I errors are of less concern with planned comparisons because only a small number of theoretically driven comparisons are being conducted.

Referring to the graph of these results in Figure 5.9 and comparing high- with low-testosterone people reveals another interesting pattern: In a high-status position, high-testosterone people do better than low-testosterone people, but in a low- status position, this pattern is reversed, and high-testosterone people do worse. However, the researchers did not predict these mean comparisons, so to do planned contrasts would be cheating. Instead, they would use a second strategy called a post hoc comparison, which controls the overall alpha by taking into account the fact that multiple comparisons are being performed. In most cases, research only permits post hoc tests if the overall F test is significant.

One popular way to conduct post hoc tests while minimizing the error rate is to use a technique called a Bonferroni correction. This technique, named after the Italian mathematician who developed it, involves simply adjusting the alpha level by the number of comparisons that are performed. For example, imagine we want to conduct 10 follow-up post hoc tests to explore the data. The Bonferroni correction would involve dividing the alpha level (0.05) by the number of comparisons (10), for a corrected alpha level of 0.005. Then, rather than using a cutoff of 0.05 for each test, we use this more conservative Bonferroni-corrected value of 0.005. Translation: Rather than accepting a Type I error rate of 5%, we are moving to a more conservative 0.5% cutoff to correct for the number of comparisons that we are performing.

8/11/20, 12(37 PMPrint

Page 84 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Another popular alternative to the Bonferroni correction is called Tukey’s HSD (for Honestly Significant Difference). This test works by calculating a critical value for mean comparisons (the HSD), and then using this critical value to evaluate whether mean comparisons are significantly different. The test manages to avoid inflating Type I error because the HSD is calculated based on the sample size, the number of experimental conditions, and the MSWG, which essentially tests all the comparisons at once. In the study by Newman et al. (2005), both of these post hoc tests were significant: Compared to those low in testosterone, high-testosterone people did better in a high-status position but worse in a low-status position, suggesting that high testosterone magnifies the effect of testing situations on cognitive performance.

EffectEffect Size Size Statistical significance is only part of the story; researchers also want to know how big the effects of their independent variables are. Researchers can calculate effect size using several ways, but in general, bigger values mean a stronger effect. One of these statistics, Cohen’s d, is calculated as the difference between two means divided by their pooled standard deviation. The resulting values can therefore be expressed in terms of standard deviations; a d of 1 means that the means are one standard deviation apart. How big should we expect our effects to be? Based on Cohen’s analyses of typical effect sizes in the social sciences, he suggests the following benchmarks: d = 0.20 is a small effect; d = 0.40 is a moderate effect; and d = 0.60 is a large effect. In addition to these qualitative categories, effect-size values can be interpreted in terms of standard deviation units. So, a d of 1 is equivalent to a standard deviation of 1. In other words, a large effect in social and behavioral sciences accounts for a little more than half of a standard deviation.

In interpreting the results of their testosterone experiment, Newman and colleagues (2005) computed effect-size measurements for two of the key mean comparisons. First, they compared high-testosterone people in the high- and low- status conditions; the size of this effect was a d = 0.78. Second, they compared the high- and low-testosterone people in the low-status condition; the size of this effect was a d = 0.61. Both of these effects fall in the “large” range based on Cohen’s benchmarks. More important, taken together with the mean comparisons, they help us to understand the way testosterone affects behavior. The authors conclude that cognitive performance stems from an interaction between biology (testosterone) and environment (assigned status) such that high-testosterone people are more responsive to their status in a given situation. When they are placed in a high-status position, they relax and perform well. Conversely, when placed in a low- status position, they become distracted and perform poorly. Researchers reach this nuanced conclusion only through an exploration of the data, using mean comparisons and effect-size measures.

8/11/20, 12(37 PMPrint

Page 85 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

5.6 Wrap-Up: Avoiding Error As this final chapter concludes, it is worth thinking back to one of the key concepts in Chapter 2 (2.4): Type I and Type II errors. Regardless of the research question, the hypothesis, or the particulars of the research design, all studies have the goal of making accurate decisions about the hypotheses. That is, we need to be able to correctly reject the null hypothesis when it is false, and fail to reject the null when it is true. Still, from time to time and despite our best efforts, we make mistakes when we draw conclusions about our hypotheses, as Table 5.4 summarizes. A Type I error, or “false positive,” involves falsely rejecting a null hypothesis and becoming excited about an effect that is due to chance. A Type II error, or “false negative,” involves failing to reject the null hypothesis and missing an effect that is real and interesting. (For a refresher on these terms, refer back to Chapter 2.)

Table 5.4: Review of Type I and Type II errors

Researcher’s Decision

Reject Null Fail to Reject Null

Null is FALSE Correct Decision Type II Error

Null is TRUE Type I Error Correct Decision

This section takes a problem-solving approach to minimizing both of these errors in an experimental context. It turns out that each error is primarily under the researcher’s control at different stages in the research process, which means reducing each error calls for different strategies.

Avoiding Type I Error

Type I errors occur when results are due to chance but are mistakenly interpreted as significant. We can generally reduce the odds of this happening by setting our alpha level at p < 0.05, meaning that we will only be excited about results that have less than a 5% chance of Type I error. However, Type I errors can still occur as a result of either extremely large samples or large numbers of statistical comparisons. Large samples can make small effects seem highly significant, so it is important to set a more conservative alpha level in large-scale studies. And, this chapter has discussed, the odds of Type I error are compounded with each statistical test we conduct.

What this means is that Type I error is primarily under researchers’ control during statistical analysis—the smarter the statistics, the lower the odds of Type I error. This chapter has discussed several examples of “smart” statistics: Instead of conducting lots of t tests, we use an ANOVA to test for differences across the entire design simultaneously. Instead of conducting t tests to compare means after an ANOVA, we use a mix of planned contrasts (for comparisons that we predicted) and post hoc tests (for other comparisons we want to explore). More advanced statistical techniques take this a step further. For example, the multivariate analysis of variance (MANOVA) statistic analyzes sets of dependent variables to reduce further the number of individual tests. Researchers use this approach when dependent variables represent different measures of a related concept, such as using heart rate, blood pressure, and muscle tension to capture the stress response. The MANOVA works, broadly speaking, by computing a weighted sum of these separate DVs (called a canonical variable) and using this new variable as the dependent variable. To learn more about this and other advanced statistical techniques, see the excellent volume by James Stevens (2002), Applied Multivariate Statistics.

Avoiding Type II Error

Type II errors occur when a real underlying relationship exists between the variables, but the statistical tests are

8/11/20, 12(37 PMPrint

Page 86 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

nonsignificant. The primary sources of this error are small samples and bad design. Small samples may fail to capture enough variability and may therefore lead to nonsignificant p values in testing an otherwise significant effect. Both large and small mistakes in experimental designs can add noise to the dataset, making it difficult to detect the real effects of independent variables.

This means that Type II error is primarily under the researcher’s control during the design process—the smarter the research designs, the lower the odds of Type II error. First, as Chapter 2 discussed, it is relatively simple to estimate the sample size needed for our research using a power calculator. These tools take basic information about the number of conditions in the research design and the estimated size of the effect and then estimate the number of people needed to detect this effect. (See Chapter 2, Figure 2.5, for an annotated example using one of these online calculators.)

Second, as every chapter has discussed, it is the experimenter’s responsibility to take steps to minimize extraneous variables that might interfere with the hypothesis test. Whether researchers are conducting an observation, a survey study, or an experiment, the overall goal is to ensure that the variables of interest are the main cause of changes in the dependent variable. This is perhaps easiest in an experimental context because these designs are usually conducted in a controlled setting where the experimenter has control over the independent variables. Nonetheless, as the chapter discussed earlier, many factors can threaten the internal validity of an experiment—from confounds to sample bias to expectancy effects. In essence, the more we can control the influence of these extraneous variables, the more confidence we can have in the results of the hypothesis test.

Table 5.5 presents a summary of the information in this section, listing the primary sources of Type I and Type II errors, as well as the time period when these are under experimenter control.

Table 5.5: Summary—avoiding error

Error Definition Main Source When You Can Control

Type I False-positive Lots of tests; lots of people Conducting stats

Type II False-negative Bad measures; not enough people Designing experiments

8/11/20, 12(37 PMPrint

Page 87 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Summary and Resources

Chapter Summary This chapter focused on experimental designs, in which the primary goal is to explain behavior in causal terms. The chapter began with an overview of experimental terminology and the key features of experiments. Three key features distinguish experiments from other research designs. First, researchers manipulate a variable, giving them a fair amount of confidence that the independent variable (IV) causes changes in the dependent variable (DV). Second, researchers control the environment, ensuring that everything about the experimental context is the same for different groups of participants —except for the level of the independent variable. Finally, the researchers have the power to assign participants to conditions using random assignment. This process helps to ensure that preexisting differences among participants (e.g., in mood, motivation, intelligence, etc.) are balanced across the experimental conditions.

Next, the chapter explained the concept of experimental validity. When evaluating experiments, researchers must take into account both internal validity—or the extent to which the IV is the cause of changes in the DV—and external validity—or the extent to which the results generalize beyond the specific laboratory setting. Several factors can threaten internal validity, including experimental confounds, selection bias, and expectancy effects. The common thread among these threats is that they add noise to the hypothesis test and cast doubt on the direct connection between IV and DV. External validity involves two components, the realism of the study and the generalizability of the findings. Psychology experiments are designed to study real-world phenomena, but sometimes compromises have to be made to study these phenomena in the laboratory. Research often achieves this balance via mundane realism, or replicating the psychological conditions of the real phenomenon. Last, researchers have more confidence in the findings of a study when they can be replicated, or repeated in different settings with different measures.

In designing the nuts and bolts of experiments, researchers have to make decisions about both the nature and number of independent variables. First, designs can be described as between-subject, within-subject, or mixed. In a between-subject design, participants are in only one experimental condition and receive only one combination of the independent variables. In a within-subject design, participants are in all experimental conditions and receive all combinations of the independent variables. Finally, a mixed design contains a combination of between- and within-subject variables. In addition, research designs can be described as either one-way or factorial. One-way designs consist of only one IV with at least two levels; factorial designs consist of at least two IVs, each having at least two levels. A factorial design produces several results to examine: the main effect of each IV plus the interaction, or combination, of the IVs.

The chapter also discussed the logic of analyzing experimental data, using the analysis of variance (ANOVA) statistic. This test works by simultaneously comparing sources of variance and therefore avoids the risk of inflated Type I error. The ANOVA (or F) is calculated as a ratio of systematic variance to error variance, or, more specifically, of between-groups variance to within-groups variance. The bigger this ratio, the more experimental manipulations contribute to overall variability in scores. However, the F statistic suggests only that differences exist in the design; further analyses are necessary to explore these differences. The chapter described an example from a published study, discussing the process of comparing means and calculating effect sizes. In comparing means, researchers use a mix of planned contrasts (for comparisons that they predicted) and post hoc tests (for other comparisons they want to explore).

Finally, the chapter concluded by referring to two recurring concepts, Type I error (false positive) and Type II error (false negative). These errors interfere with the broad goal of making correct decisions about the status of a hypothesis. Thus, the purpose of this final section was to review ways to minimize errors. Type I errors are primarily inflated by large samples and lots of statistical analyses. Consequently, this error is under the experimenter’s control at the data-analysis stage. Type II errors are primarily inflated by small samples and flaws in the experimental design. Consequently, this error is under the experimenter’s control at the design and planning stage.

8/11/20, 12(37 PMPrint

Page 88 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Key Terms

analysis of variance (ANOVA) A statistical procedure that tests for differences by comparing the variance explained by systematic factors to the variance explained by error.

between-subject design Experimental design in which each group of participants is exposed to only one level of the independent variable.

Bonferroni correction A post hoc test that involves adjusting the alpha level by the number of comparisons to set a more conservative cutoff.

carryover effect Effects of one level are present when another level is introduced, making it difficult to separate the effects of different levels.

conceptual replication Testing the relationship between conceptual variables using new operational definitions.

condition One of the versions of an independent variable, forming different groups in the experiment; in a factorial design, refers to the groups formed by combinations of IVs.

confounding variable (or confound) A variable that changes systematically with the independent variable.

constructive replication Recreation of the original experiment that adds elements to the design; usually designed to rule out alternative explanations or extend knowledge about the variables under study.

control condition Group within the experiment that does not receive the experimental treatment.

counterbalancing Variation of the order of presentation among participants to reduce order effects.

cover story A misleading statement to participants about what is being studied to prevent effects of demand characteristics.

demand characteristic Cue in the study that leads participants to guess the hypothesis.

differential attrition Loss of participants, who drop out of experimental groups for different reasons.

environmental manipulation Changing some aspect of the experimental setting.

exact replication Recreation of the original experiment as closely as possible to verify the findings.

8/11/20, 12(37 PMPrint

Page 89 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

experimental condition Group within the experiment that receives a treatment designed to test a hypothesis.

experimental design Design whose primary goal is to explain causes of behavior.

experimenter expectancy Researchers see what they expect to see, leading to subtle bias in favor of their hypotheses; threat to internal validity.

external validity A metric that assesses generalizability of results beyond the specific conditions of the experiment.

extraneous variable Variable that adds noise to a hypothesis test.

factorial design A design that has two or more independent variables, each with two or more levels.

factorial notation A system for describing the number of variables and the number of levels in experimental designs.

fatigue effect Decline of participants’ performance as a result of repeated testing.

generalizability The extent to which results extend to other studies, using a wide variety of populations and of operational definitions.

instructional manipulation Changing the way a task is described to change participants’ mind-sets.

interaction The combined effect of variables in a factorial design; the effects of one IV are different depending on the levels of the other IV.

internal validity A metric that assesses the degree to which results can be attributed to independent variables.

invasive manipulation Taking measures to change internal, physiological processes; usually conducted in medical settings.

level Another way to describe the versions of an independent variable; describes the specific circumstances created by manipulating a variable.

main effect The effect of each independent variable on the dependent variable, collapsing across the levels of other variables.

marginal mean The combined mean of one factor across levels of another factor.

matched random assignment

8/11/20, 12(37 PMPrint

Page 90 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

A variation on random assignments; ensures that an important variable is equally distributed between or among the groups; the experimenter obtains scores on an important matching variable, ranks participants on this variable, and then randomly assigns participants to conditions.

mixed design Experimental design that contains at least one between-subject variable and at least one within-subject variable.

multivariate analysis of variance (MANOVA) A statistic that analyzes sets of dependent variables to reduce the number of individual tests.

mundane realism Research that replicates the psychological conditions of the real-world phenomenon; criterion for judging external validity.

one-way design A design that has only one independent variable, with two or more levels to the variable.

order effect Moderation of the effects because of the order in which levels occur.

participant replication Repetition of the study with a new population of participants; usually driven by a compelling theory as to why the two populations differ.

placebo control Group added to a study to reduce placebo effects; mimics the experimental condition in every way but one.

placebo effect Change resulting from the mere expectation that change will occur.

planned comparison (or a priori comparison) Comparisons that involve comparing only the means for which differences were predicted by the hypothesis.

post hoc comparison Comparison that controls the overall alpha by taking into account that multiple comparisons are being performed; usually allowed only if the overall F test is significant.

practice effect Improvement of participants’ performance as a result of repeated testing.

quasi-independent variable Preexisting difference used to divide participants in an experimental context; referred to as “quasi” because variables are being measured, not manipulated, by the experimenter.

random assignment A technique for assigning participants to conditions; before participants arrive, the experimenter makes a random decision for each participant’s placement in a group.

replication Repetition of research results in different contexts and/or different laboratories.

8/11/20, 12(37 PMPrint

Page 91 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

selection bias Occurs when groups are different before the manipulation; problematic because preexisting differences might be the driving factor behind the results.

Tukey’s HSD (HHonestly SSignificant DDifference) A post hoc test that calculates a critical value for mean comparisons (the HSD) and then uses this critical value to evaluate whether mean comparisons are significantly different.

unrelated-experiments technique A strategy for preventing the effects of demand characteristics, leading participants to believe that they are completing two experiments during one session; experimenter can use this to present the independent variable during the first experiment and measure the dependent variable during the second experiment.

within-subject design Experimental design in which each group of participants is exposed to all levels of the independent variable.

Chapter 5 Flashcards

Apply Your Knowledge 1. List and briefly describe the three distinguishing features of an experiment.

8/11/20, 12(37 PMPrint

Page 92 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

2. List the three types of expectancy effect that can affect experimental results, and name one way to avoid each type.

3. The following designs are described using factorial notation. For each one, state (a) the number of variables in the design, (b) the number of levels each variable has, and (c) the total number of experimental conditions.

3 × 3 × 3

2 × 3 × 4

4 × 4

2 × 2 × 2 × 2

4. Forty students were asked to rate two authors according to their knowledge of certain topic areas. Each student was given two passages to read. In one passage (“Brain”), the author discussed the roles of various brain structures in perceptual-motor coordination. In the second passage (“Motivation”), the author described ways to enhance motivation in preschool children. For half the students, both passages were written by male authors. For the other half of the students, both passages were written by a female author. After reading the passages, students rated the authors’ knowledge of their topic areas on a scale ranging from 1 (displays very little knowledge) to 10 (displays a

8/11/20, 12(37 PMPrint

Page 93 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=navp…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8

Research Scenarios: Try It

thorough knowledge). 5.

Male Author Female Author

Brain 9 4

Motivation 6 7

(1) Identify the following information about the design: (2) Describe the design using factorial notation (e.g., 4 × 3). (3) Identify the total number of conditions. (4) Identify the design (circle one): between-subject within-subject mixed

6. For each of the following scenarios, identify what a Type I error and a Type II error would look like. Then, determine which type would be a bigger problem for that scenario.

a. A large international airport has received a bomb threat. In response, the airport police have tightened security and now check every piece of luggage manually. (1) Type I: (2) Type II: (3) Bigger problem:

b. Your friend purchases a pregnancy test. (1) Type I: (2) Type II: (3) Bigger problem:

Critical Thinking Questions 1. Explain the advantages and disadvantages of a within-subject design. 2. Compare and contrast the following terms. Your answers should demonstrate that you understand each term. Be

sure to give some kind of context (e.g., “both are types of . . .”) or provide an example, and state how they are different.

a. internal versus external validity b. between-subjects versus within-subject design c. level versus condition

3. Explain the difference between Type I and Type II errors. How can each type of error be minimized?

8/11/20, 12(37 PMPrint

Page 94 of 94https://content.ashford.edu/print/Newman.2681.16.1?sections=nav…&clientToken=b4349e22-4c0e-1d7b-dcca-46376762b169&np=navpoint-8