Final Paper: Research Proposal
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 1/40
Learning Outcomes
By the end of this chapter, you should be able to:
Use appropriate terminology when discussing experimental designs. Identify the key features of experiments for making causal statements. Explain the importance of both internal and external validity in experiments. Describe the threats to both internal and external validity in experiments. Outline the most common types of experimental designs. Describe methods for analyzing experimental data. Summarize methods for avoiding Type I and Type II error.
One of the oldest debates within psychology concerns the relative contributions of biology and the environment in shaping our thoughts, feelings, and behaviors. Do we become who we are because it is hard-wired into our DNA, or because of our early experiences? Do people share their parents’ personality quirks because they carry their parents’ genes, or because they grew up in their parents’ homes? Researchers can, in fact, address these types of questions in several ways. A consortium of researchers at the University of Minnesota has spent the past three decades comparing pairs of identical and fraternal twins, raised in the same versus different households, to tease
5 Experimental Designs—Explaining Behavior
Antonio Oquias/Hemera/Thinkstock
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 2/40
apart the contributions of genes and environment. Read more at the research group’s website, http://mctfr.psych.umn.edu/ (http://mctfr.psych.umn.edu/) .
An alternative way to separate genetic and environmental in�luence is through the use of experimental designs, which have the primary goal of explaining the causes of behavior. Recall from the design overview in Chapter 2 (2.1) that experiments can address causal relationships because the experimenter has control over the environment as well as over the manipulation of variables. One particularly ingenious example comes from the laboratory of Michael Meaney, a professor of psychiatry and neurology at McGill University. Meaney used female rats as experimental subjects (Francis, Dioro, Liu, & Meaney, 1999). His earlier research had revealed that the parenting ability of female rats could be reliably classi�ied based on how attentive they were to their rat pups, as well as how much time they spent grooming the pups. The question tackled in the 1999 study was whether these behaviors were learned from the rats’ own mothers or transmitted genetically. To answer this question experimentally, Meaney and colleagues had to think very carefully about the comparisons they wanted to make. To simply compare the offspring of good and bad mothers would have been insuf�icient—this approach could not distinguish between genetic and environmental pathways.
Instead, Meaney decided to use a technique called cross-fostering, or switching rat pups from one mother to another as soon as they were born. The technique resulted in four combinations of rats: (1) those born to inattentive mothers but raised by attentive ones, (2) those born to attentive mothers but raised by inattentive ones, (3) those born and raised by attentive mothers, and (4) those born and raised by inattentive mothers. Meaney then tested the rat pups several months later and observed the way they behaved with their own offspring. Meaney’s control over all aspects of how the rat pups were raised was a critical element; he was able to keep everything the same except for the combination of their genetics and rearing environment. The setup of this experiment allowed Meaney to make clear comparisons between the in�luence of birth mothers and the rearing process. At the end of the study, the conclusion was crystal clear: Maternal behavior is all about the environment. Those rat pups that ultimately grew up to be inattentive mothers were those who had been raised by inattentive mothers.
This �inal chapter is dedicated to experimental designs, in which the primary goal is to explain behavior. Experimental designs rank highest on the continuum of control (see Figure 5.1) because the experimenter can manipulate variables, minimize extraneous variables, and assign participants to conditions. The chapter begins with an overview of the key features of experiments and then explains the importance of both internal and external validity of experiments. From there, the discussion moves to the process of designing and interpreting experiments and concludes with a summary of strategies for minimizing error in experiments.
Figure 5.1: Experimental designs on the continuum of control
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 3/40
5.1 Experiment Terminology Before we dive into the details, it is important to cover the terminology that the chapter will use to describe different aspects of experimental designs. Much of this will be familiar from Chapter 2, with a few new additions. First, we will review the basics.
Recall that a variable is any factor that has more than one value. For example, height is a variable because people can be short, tall, or anywhere in between. Depression is a variable because people can experience a wide range of symptoms, from mild to severe. The independent variable (IV) is the variable that is manipulated by the experimenter to test hypotheses about cause. The dependent variable (DV) is the variable that is measured by the experimenter to assess the effects of the independent variable. For example, in an experiment testing the hypothesis that fear causes prejudice, fear would be the independent variable and prejudice would be the dependent variable. To keep these terms straight, it is helpful to think of the main goal of experimental designs. That is, we test hypotheses about cause by manipulating an independent variable and then looking for changes in a dependent variable. This means that we think the independent variable causes changes in the dependent variable; for example, we hypothesize that fear causes changes in prejudice.
When we manipulate an independent variable, we will always have two or more versions of the variable; this is what distinguishes experiments from, say, structured observational studies. One common way to describe the versions of the IV is in terms of different groups, or conditions. The most basic experiments have two conditions: The experimental condition receives a treatment designed to test the hypothesis, while the control condition does not receive this treatment. In the fear and prejudice example above, the participants who make up the experimental condition would be made to feel afraid, while the participants who make up the control condition would not. This setup allows us to test whether introducing fear to one group of participants leads them to express more prejudice than the other group of participants, who are not made fearful.
Another common way to describe these versions is in terms of levels of the independent variable. Levels describe the speci�ic set of circumstances created by manipulating a variable. For example, in the fear and prejudice experiment, the variable of fear would have two levels—afraid and not afraid. We have countless ways to operationalize fear in this experiment. One option would be to adopt the technique used by the Stanford social psychologist Stanley Schachter (1959), who led participants to believe they would be exposed to a series of painful electric shocks. In Schachter’s study, the painful shocks never happened, but they did induce a fearful state as people anticipated them. So, those at the “afraid” level of the independent variable might be told to expect these shocks, while those at the “not afraid” level of the independent variable would not be given this expectation.
At this stage, having two sets of vocabulary terms—”levels” and “conditions”—for the same concept may seem odd. However, with advanced experimental designs using multiple independent variables, there is a subtle difference in how these terms are used. As the designs become more complex, it is often necessary to expand IVs to include several groups and multiple variables. At that point, researchers need different terminology to distinguish between the versions of one variable and the combinations of multiple variables. The chapter will later return to this complexity, in the section “Experimental Design.”
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 4/40
Cannot load M3U8: Crossdomain access denied
5.2 Key Features of Experiments The overview of designs in Chapter 2 described the overall process of experiments in the following way: Researchers control the environment as much as possible so that all participants have the same experience. The researchers then manipulate, or change, one key variable, and then measure the outcomes in another key variable. This section examines this process in more detail. Experiments can be distinguished from all other designs by three key features: manipulating variables, controlling the environment, and assigning people to groups.
Manipulating Variables
The most crucial element of an experiment is researcher’s manipulation, or change, of some key variable. To study the effects of hunger, for example, a researcher could manipulate the amount of food given to the participants, or to study the effects of temperature, the experimenter could raise and lower the temperature of the thermostat in the laboratory. In both cases, recall that the researcher needs a way to operationalize the concepts (hunger and temperature) into measurable variables. For example, the experimenter could de�ine “hungry” as being deprived of food for eight hours, and de�ine a “hot” room as being 90 degrees Fahrenheit. Because these factors are under the direct control of the experimenters, they can feel more con�ident that changing them contributes to changes in the dependent variables.
Chapter 2 discussed the main shortcoming of correlational research: These designs do not allow researchers to make causal statements. Recall from that chapter (as well as from Chapter 4) that correlational research is designed to predict one variable from another. One of the examples in Chapter 2 concerned the correlation between income levels and happiness, with the goal of trying to predict happiness levels based on knowing people’s income level. If we measure these as they occur in the real world, we cannot say for sure which variable causes the other. However, we could settle this question relatively quickly with the right experiment. Suppose we bring two groups into the laboratory and give one group $100 and a second group nothing. If the �irst group is happier at the end of the study, it would support the idea that money really does buy happiness. Of course, this experiment is a rather simplistic look at the connection between money and happiness. Even so, because we manipulate levels of money, this study would bring us closer to making causal statements about the effects of money.
To manipulate variables, it is necessary to have at least two versions of the variable. That is, to study the effects of money, we need a comparison group that does not receive money. To study the effects of hunger, we would need both a hungry and a not-hungry group. Having two versions of the variable distinguishes experimental designs from the structured observations discussed in Chapter 3 (3.4), in which all participants receive the same set of conditions in the laboratory. Even the most basic experiment must have two sets of conditions, which are often an experimental group and a control group. However, as this chapter will later explain, experiments can become much more complex. A study might have one experimental group and two control groups, or �ive degrees of food deprivation, ranging from 0 to 12 hours without food. Decisions about the number and nature of these groups will depend on consideration of both the hypotheses and previous literature.
Researchers have three options for manipulating variables. First, environmental manipulations involve changing some aspect of the setting. Environmental manipulations are perhaps the most common in psychology studies, and they include everything from varying the room temperature to varying the amount of money people receive. The key is
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 5/40
Monkey Business Images/Monkey Business/Thinkstock
Having a patient run on a treadmill to measure cardiovascular stress is an example of invasive manipulation.
to change the way that different groups of people experience their time in the laboratory—it is either hot or cold, and they either receive or do not receive $100.
Second, instructional manipulations involve changing the way a task is described to change participants’ mindsets. For example, a researcher might give the same math test to all participants but to one group, describe it as an “intelligence test” and to another group, a “problem- solving task.” Because an intelligence test is thought to have implications for life success, the experimenter might expect participants in that group to be more nervous about their scores.
Finally, an invasive manipulation involves taking measures to change internal, physiological processes; it is usually conducted in medical settings. For example, studies of new drugs involve administering the drug to volunteers to determine whether it has an effect on some physical or psychological symptom. Alternatively, studies of cardiovascular health often involve having participants run on a treadmill to measure how the heart functions under stress.
The rule that we must manipulate a variable has one quali�ication. In many experiments, researchers divide participants based on a preexisting difference (e.g., gender) or personality measures (e.g., self- esteem or neuroticism) that capture stable individual differences among people. The idea behind these personality measures is that someone scoring high on a measure of neuroticism (for example) would be expected to be more neurotic across situations than someone scoring lower on the measure. Using this technique allows a researcher to compare how, for example, men and women or people with high and low self-esteem respond to manipulations.
When researchers use preexisting differences in an experimental context, they are referred to as quasi- independent variables—”quasi,” or “nearly,” because they are being measured, not manipulated, by the experimenter, and thus do not meet the criteria for a regular independent variable. In fact, variables used in this way are things that cannot be manipulated by an experimenter—either for practical or ethical reasons—including gender, race, age, eye color, religion, and so forth. Instead, these are treated as independent variables in that participants are divided into groups along these variables (e.g., male versus female; Catholic versus Protestant versus Muslim).
Because these variables are not manipulated, an experimenter cannot make causal statements about them. For a study to count as an experiment, these quasi-independent variables would have to be combined with a true independent variable. This could be as simple as comparing how men and women respond to a new antidepressant drug—gender would be quasi-independent while drug type would be a true independent variable.
Sometimes the line between true and quasi-experiments can be subtle. Imagine we want to study the effects on people’s persistence at a second task based on winning versus losing a contest. In a quasi-experimental approach, we could have two participants play a game, resulting in a natural winner and loser, and then compare how long each one stuck with the next game. The approach’s limitation is that some preexisting condition might have affected winning and losing the �irst game. Perhaps the winners had more self-con�idence and patience at the start. However, we could improve the design to be a true experiment by having participants play a rigged game against a confederate, thereby causing participants either to win or lose. In this case, we would be manipulating winning and losing, and preexisting differences would be averaged out across the groups (more on this later in the chapter).
Controlling the Environment
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 6/40
The second important element of experimental designs is the researcher’s high degree of control over the environment. In addition to manipulating variables, an experimenter has to ensure that the other aspects of the environment are the same for all participants. For instance, if we were interested in the effects of temperature on people’s mood, we could manipulate temperature levels in the laboratory so that some people experienced warmer temperatures and other people cooler temperatures. However, it is equally important to make sure that other potential in�luences on mood are the same for both groups. That is, we would want to make sure that the “warm” and “cool” groups were tested in the same room, around the same time of day, and by similar experimenters.
The overall goal, then, is to control extraneous variables, or variables that add noise to the hypothesis test. In essence, the more researchers can control extraneous variables, the more con�idence they can have in the results of the hypothesis test. As the section “Validity and Control” will discuss, these extraneous variables can have different degrees of impact on a study. Imagine we conduct the study on temperature and mood, and all of our participants are in a windowless room with a �lickering �luorescent light. This environment would likely in�luence people’s mood —making everyone a little bit grumpy—but it causes fewer problems for our hypothesis test because it affects everyone equally. Table 5.1 shows hypothetical data from two variations of this study, using a 10-point scale to measure mood ratings. In the top row, participants were in a well-lit room; notice that participants in the cooler room reported being in a better mood (i.e., an 8 versus a 5). In the bottom row, all participants were in the windowless room with �lickering lights. These numbers suggest that people were still in a better mood in the cooler room (5) than a warm room (2), but the �lickering �luorescent light had a constant dampening effect on everyone’s mood.
Table 5.1: In�luence of an extraneous variable
Cool Room Warm Room
Variation 1: Well-Lit 8 5
Variation 2: Flickering Fluorescent 5 2
Assigning People to Conditions
The third key feature of experimental designs is that the researcher can assign people to receive different conditions, or versions, of the independent variable. This is an important piece of the experimental process: Experimenters not only control the options—warm versus cool room, $100 versus no money, etc.—but they also control which participants get each option. Whereas a correlational design might assess the relationship between current mood and choosing the warm room, an experimental design will assign some participants to the warm room and then measure the effects on their mood. In other words, experimenters are able to make causal statements because they cause things to happen to a particular group of people.
The most common, and most preferable, way to assign people to conditions is through a process called random assignment. An experimenter who uses random assignment makes a separate decision for each participant as to which group he or she will be assigned to before the participant arrives. As the term implies, this decision is made randomly—by �lipping a coin, using a random number table (for an example, see http://stattrek.com/tables/random.aspx (http://stattrek.com/tables/random.aspx) ), drawing numbers out of an envelope, or even simply alternating back and forth between experimental conditions. The overall goal is to try to balance preexisting differences among people, as Figure 5.2 illustrates. So, for example, some people might generally be more comfortable in warm rooms, while others might be more comfortable in cold rooms. If each person who shows up for the study has an equal chance of being in either group, then the groups in the sample should re�lect the same distribution of differences as the population.
Figure 5.2: Random assignment
The 24 participants in our sample consist of a mix of happy and sad people. The goal of random assignment is to have these differences distributed equally across the experimental
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 7/40
conditions. Thus, the two groups on the right each consist of six happy and six sad people, and our random assignment was successful.
Forming groups through random assignment also has the signi�icant advantage of helping to avoid bias in the selection and assignment of subjects. For example, it would be a bad idea to assign people to groups based on a �irst impression of them because participants might be placed in the cold room if they arrived at the laboratory dressed in warm clothing. Experimenters who make decisions about condition assignments ahead of time can be more con�ident that the independent variable is responsible for changes in the dependent variable.
Worth highlighting here is the difference here between random selection and random assignment (discussed in Chapter 4). Random selection means that the sample of participants is chosen at random from the population, as with the probability sampling methods discussed in Chapter 4. However, most psychology experiments use a convenience sample of individuals who volunteer to complete the study. This means that the sample is often far from fully random. However, a researcher can still make sure that the study involves random assignment to groups, so that each condition contains an equal representation of the sample.
In some cases—most notably, when samples are small—random assignment may not be suf�icient to balance an important characteristic that might affect the results of a particular study. Imagine conducting a study that compared two strategies for teaching students complex math skills. In this example, it would be especially important to make sure that both groups contained a mix of individuals with, say, average and above-average intelligence. For this reason, the experimenter would necessarily take extra steps to ensure that intelligence was equally distributed between the groups, which can be accomplished with a variation on random assignment called matched random assignment. This kind of assignment requires the experimenter to obtain scores on an important matching variable —in this case, intelligence—rank participants based on the matching variable, and then randomly assign people to conditions. Figure 5.3 shows how this process would unfold in our math-skills study. First, the researcher gives participants an IQ test to measure preexisting differences in intelligence. Second, the experimenter ranks participants based on these scores, from highest to lowest. Third, the experimenter moves down this list in order and randomly assigns each participant to one of the conditions. This process still contains an element of random assignment, but adding the extra step of rank ordering ensures a more balanced distribution of intelligence test scores across the conditions.
Figure 5.3 Matched random assignment
The 20 participants in our sample represent a mix of very high, average, and very low intelligence test scores (measured 1–100). The goal of matched random assignment is to ensure that this variation is distributed equally across the two conditions. The experimenter would �irst rank participants by intelligence test scores (top box), and then distribute these participants alternately between the conditions. The end result is that both groups (lower boxes) contain a good mix of high, average, and low scores.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 8/40
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 9/40
Digital Vision/Photodisc/Thinkstock
5.3 Experimental Validity Chapter 2 discussed the concept of validity, or the degree to which the measures used in a study capture the constructs that they were designed to capture. That is, a measure of happiness needs to capture differences in people’s levels of happiness. This section returns to the subject of validity in an experimental context, assessing whether experimental results demonstrate the causal relationships that researchers think they are demonstrating. We will discuss two types of validity that are relevant to experimental designs. The �irst is internal validity, which assesses the degree to which results can actually be attributed to the independent variables. The second is external validity, which assesses how well the results generalize to situations beyond the speci�ic conditions laid out in the experiment. Taken together, internal and external validity provide a way to assess the merits of an experiment. However, each kind has its own threats and remedies, as the following sections explain.
Internal Validity
To have a high degree of internal validity, experimenters strive for maximum control over extraneous variables. That is, they try to design experiments so that the independent variable is the only cause of differences between groups. But, of course, no study is ever perfect, and some degree of error is always in place. In many cases, errors are the result of unavoidable random causes, such as the health or mood of the participants on the day of the experiment. In other cases, errors are due to factors that are, in fact, within the experimenter’s control. This section focuses on several of these more manageable threats to internal validity and discusses strategies for reducing their in�luence.
Experimental Confounds To avoid threats to the internal validity of an experiment, it is important to control and minimize the in�luence of extraneous variables that might add noise to a hypothesis test. In many cases, extraneous variables can be considered relatively minor nuisances, as when the mood experiment was inadvertently run in a depressing room. Now, though, suppose we conduct our study on temperature and mood, and due to a lack of careful planning, accidentally place all of the “warm room” participants in a sunny room, and the “cool room” participants in a windowless room. We might very well �ind that the warm-room participants are in a much better mood. Still, is this the result of warm temperatures or the result of exposure to sunshine? Unfortunately, we would be unable to tell the difference because of a confounding variable (or confound)—a variable that changes systematically with the independent variable. In this example, room lighting is confounded with room temperature because all of the warm- room participants are also exposed to sunshine, and all of the cool-room participants are not. This confounding combination of variables leaves us unable to determine which variable actually has the effect on mood. In other words, because our groups differ in more than one way, we cannot clearly say that the independent variable of interest (the room) caused the dependent variable (mood) to change.
This observation may seem oversimpli�ied, but the way to avoid confounds is to be very careful in designing experiments. By ensuring that groups are alike in every way but the experimental condition, an experimenter can generally prevent confounds. Nevertheless, avoiding confounds is somewhat easier said than done because they can come from unexpected places. For example, most studies involve the use of multiple research assistants who manage data collection and interact with participants. Some of these assistants might be more or less friendly than others, so it is important to make sure each of them interacts with participants in all conditions. The friendliest assistant’s always running participants in the warm-room group, for example, would result in a confounding variable (friendly versus unfriendly assistants) between room and research assistant. Consequently, the experimenter would be unable to
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 10/40
Friendliness of the research assistant is a variable that can affect the outcome of an experiment.
separate the in�luence of the independent variable (the room) from that of the confound (the research assistant).
Selection Bias Internal validity can also be threatened when groups differ before the manipulation, a condition known as selection bias. Selection bias causes problems because these preexisting differences might be the driving factor behind the results. Imagine someone is investigating a new program that will help people stop smoking. The experimenter might decide to ask for volunteers who are ready to quit smoking and put them through a six-week program. But by asking for volunteers—a remarkably common error—the researcher gathers a group of people who are already somewhat motivated to stop smoking. Thus, it is dif�icult to separate the effects of the new program from the effects of this preexisting motivation.
One easy way to avoid this problem is through either random or matched random assignment. In the stop-smoking example, a researcher could still ask for volunteers, but then randomly assign these volunteers either to the new program or to a control group. Both groups consisting of people motivated to quit smoking would help to cancel out the effects of motivation. Another way to minimize selection bias is to use the same people in both conditions so that they serve as their own control. In the stop-smoking example, the experimenter could assign volunteers �irst to one program and then to the other. However, this approach might present a problem: Participants who successfully quit smoking in the �irst program would not bene�it from the second program. This technique is known as a within- subject design, and we will discuss its advantages and disadvantages in the section “Within-Subject Designs.”
Differential Attrition Despite researchers’ best efforts at random assignment, they could still have a biased sample at the end of a study as a result of differential attrition. The problem of differential attrition occurs when subjects drop out of experimental groups for different reasons. Say we are conducting a study of the effects of exercise on depression levels. We manage to randomly assign people either to one week of regular exercise or to one week of regular therapy. At �irst glance, it appears that the exercise group shows a dramatic drop in depression symptoms. But then we notice that about one-third of the people in this group dropped out before completing the study. Chances are we are left with the participants who are most motivated to exercise, to overcome their depression, or both. Thus, it is dif�icult to isolate the effects of the independent variable on depression symptoms. Although we cannot prevent people from dropping out of our study, we can look carefully at those who do. In many cases, researchers can spot a pattern and use it to guide future research. For example, it may be possible to create a pro�ile of people who dropped out of the exercise study and use this knowledge to increase retention for the next attempt.
Outside Events As much as experimenters strive to control the laboratory environment, participants are often in�luenced by events in the outside world. These events—sometimes called history effects—are often large scale and include political upheavals and natural disasters. History effects threaten research because they make it dif�icult to tell whether participants’ responses are due to the independent variable or to the historical event(s). A paper published by social psychologist Ryan Brown, now a professor at the University of Oklahoma, offers a remarkable example. Brown et al.’s paper discussed the effects of receiving different types of af�irmative action as people were selected for a leadership position. The goal was to determine the best way to frame af�irmative action to avoid undermining the recipient’s con�idence (Brown, Charnsangavej, Keough, Newman, & Rentfrow, 2000). For about a week during the data-collection process, students at the University of Texas where the study was being conducted protested on the school’s main lawn about a controversial lawsuit regarding af�irmative-action policies. One side effect of these protests was that participants arriving for Brown’s study had to pass through a swarm of people holding signs that either denounced or supported af�irmative action. These types of outside events are dif�icult, if not impossible, to control. But, because these researchers were aware of the protests, they made a decision to exclude data gathered from participants during the week of the protests from the study, thus minimizing the effects of outside events.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 11/40
Expectancy Effects One �inal set of threats to internal validity results from the in�luence of expectancies on people’s behavior. This in�luence can cause trouble for experimental designs in three related ways. First, experimenter expectancies can cause researchers to see what they expect to see, leading to subtle bias in favor of their hypotheses. In a clever demonstration of this phenomenon, the psychologist Robert Rosenthal asked his graduate students at Harvard University to train groups of rats to run a maze (Rosenthal & Fode, 1963). He also told them that based on a pretest, the rats had been classi�ied as either bright or dull. As might be surmised, these labels were pure �iction, but they still in�luenced the way that the students treated the rats. Those labeled bright were given more encouragement and learned the maze much more quickly than rats labeled dull. Rosenthal later extended this line of work to teachers’ expectations of their students (Rosenthal & Jacobson, 1992) and found support for the same conclusion: People often bring about the results they expect by behaving in a particular way.
One common way to avoid experimenter expectancies is to have participants interact with a researcher who is “blind” to (i.e., unaware of) the condition in which each participant is. Blind researchers may be fully aware of the general research hypothesis, but their behavior is less likely to affect the results if they are unaware of the speci�ic conditions. In the Rosenthal and Fode (1963) study, the graduate students’ behavior only in�luenced the rats’ learning speed because they were aware of the labels bright and dull. If these had not been assigned, the rats would have been treated fairly equally across the conditions.
Second, participants in a research study often behave differently based on their own expectancies about the goals of the study. These expectancies often develop in response to demand characteristics, or cues in the study that lead participants to guess the hypothesis. In a well-known study conducted at the University of Wisconsin, psychologists Leonard Berkowitz and Anthony LePage (1967) found that participants would behave more aggressively—by delivering electric shocks to another participant—if a gun was in the room than if there were no gun present. This �inding has some clear implications for gun-control policies, suggesting that the mere presence of guns increases the likelihood of gun violence. However, a common critique of this study contends that participants may have quickly clued in to its purpose and �igured out how they were “supposed” to behave. That is, the gun served as a demand characteristic, possibly making participants act more aggressively because they thought the researchers expected them to do so.
To minimize demand characteristics, researchers use a variety of techniques, all of which attempt to hide the true purpose of the study from participants. One common strategy is to use a cover story, or a misleading statement about what is being studied. Chapter 1 (1.3) discussed Milgram’s famous obedience studies, which discovered that people were willing to obey orders to deliver dangerous levels of electric shocks to other people. To disguise the purpose of the study, Milgram described it to participants as a study of punishment and learning. To give another example, Ryan Brown and colleagues (2000) presented their af�irmative-action study as a study of leadership styles. These cover stories aimed to give participants a compelling explanation for what they experienced during the study and to direct their attention away from the research hypothesis.
Another strategy for avoiding demand characteristics is to use the unrelated-experiments technique, which leads participants to believe that they are completing two different experiments during one laboratory session. The experimenter can use this bit of deception to pre-sent the independent variable during the �irst experiment and then measure the dependent variable during the second experiment. For example, a study by Harvard psychologist Margaret Shih and colleagues (Shih, Pittinsky, & Ambady, 1999) recruited Asian-American females and asked them to complete two supposedly unrelated studies. In the �irst, they were asked to read and form impressions of one of two magazine articles; these articles were designed to make them focus on either their Asian-American identity or their female identity. In the second experiment, they were asked to complete a math test as quickly as possible. The goal of this study was to examine the effects of priming different aspects of identity on math performance. Based on previous research, these authors predicted that priming an Asian-American identity would remind participants of positive stereotypes regarding Asians and math performance, whereas priming a female identity would remind participants of negative stereotypes regarding women and math performance. As researchers expected, priming an Asian-American identity led this group of participants to do better on a math test than did priming a female identity.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 12/40
Martin Poole/Digital Vision/Thinkstock
The placebo effect can test whether alcohol affects behavior, or whether people just expect it to and exhibit changed behavior based on their expectations.
The unrelated-experiments technique was especially useful for this study because it kept participants from connecting the independent variable (magazine article prime) with the dependent variable (math test).
A �inal way in which expectancies shape behavior is the placebo effect, meaning that change can result from the mere expectation that change will occur. Imagine we want to test the hypothesis that alcohol causes people to become aggressive. One relatively easy way to do this would be to give alcohol to a group of volunteers (aged 21 and older) and then measure how aggressively they behave in response to being provoked. The problem with this approach is that people also expect alcohol to change their behavior, and so we might see changes in aggression simply because of these expectations. Fortunately, the problem has an easy solution: add a placebo control group to the study that mimics the experimental condition in every way but one. In this case, we might tell all participants that they will be drinking a mix of vodka and orange juice but only add vodka to half of the participants’ drinks. The orange-juice- only group serves as our placebo control. Any differences between this group and the alcohol group can be attributed to the alcohol itself.
External Validity
To have a high degree of external validity in experiments, researchers strive for maximum realism in the laboratory environment. External validity means that the results extend beyond the particular set of circumstances created in a single study. Recall that science is a cumulative discipline and that knowledge grows one study at a time. Thus, each study is more meaningful: 1) to the extent that it sheds light on a real phenomenon; and 2) to the extent that the results generalize to other studies. This section examines each of these criteria separately.
Mundane Realism The �irst component of external validity is the extent to which an experiment captures the real-world phenomenon under study. Inspired by a string of school shootings in the 1990s, one popular question in the area of aggression research asks whether rejection by a peer group leads to aggression. That is, when people are rejected from a group, do they lash out and behave aggressively toward the members of that group? Researchers must �ind realistic ways to manipulate rejection and measure aggression without infringing on participants’ welfare. Given the need to strike this balance, how real can conditions be in the laboratory? How do we study real-world phenomena without sacri�icing internal validity?
The answer is to strive for mundane realism, meaning that the research replicates the psychological conditions of the real-world phenomenon (sometimes referred to as ecological validity). In other words, we need not recreate the phenomenon down to the last detail; instead, we aim to make the laboratory setting feel like the real-world phenomenon. Researchers studying aggressive behavior and rejection have developed some rather clever ways of doing this, including allowing participants to administer loud noise blasts or serve large quantities of hot sauce to those who reject them. Psychologically, these acts feel like aggressive revenge because participants are able to lash out against those who rejected them—with the intent of causing harm—even though the behaviors themselves may differ from the ways people exact revenge in the real world.
In a 1996 study, Tara MacDonald and her colleagues at Queen’s University in Ontario, Canada, examined the relationship between alcohol and condom use (MacDonald, Zanna, & Fong, 1996). The authors were intrigued by a puzzling set of real-world data: Most people self-reported that they would use condoms when engaging in casual sex, but actual rates of unprotected sex (i.e., having sexual intercourse without a condom) were also remarkably
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 13/40
high. In this study, the authors found that alcohol was a key factor in causing “common sense to go out the window” (p. 763), resulting in a decreased likelihood of condom use. But how on earth might they study this phenomenon in the laboratory? In the authors’ words, “even the most ambitious of scientists would have to conclude that it is impossible to observe the effects of intoxication on actual condom use in a controlled laboratory setting” (p. 765).
To solve this dilemma, MacDonald and colleagues developed a clever technique for studying people’s intentions to use condoms. Participants were randomly assigned to either an alcohol or placebo condition, and then they viewed a video depicting a young couple faced with the dilemma of whether to have unprotected sex. At the key decision point in the video, the tape was stopped and participants were asked what they would do in the situation. As predicted, participants who were randomly assigned to consume alcohol said they would be more willing to proceed with unprotected sex. While this laboratory study does not capture the full experience of making decisions about casual sex, it does a pretty nice job of capturing the psychological conditions involved.
Generalizing Results The second component of external validity, generalizability, refers to the extent to which the results extend to other studies by using a wide variety of populations and a wide variety of operational de�initions (sometimes referred to as population validity). If we conclude that rejection causes people to become more aggressive, for example, this conclusion should ideally carry over to other studies of the same phenomenon, studies that use different ways of manipulating rejection and different ways of measuring aggression. If we want to conclude that alcohol reduces the intention to use condoms, we would need to test this relationship in a variety of settings—from laboratories to nightclubs—using different measures of intentions.
Thus, each single study researchers conduct is limited in its conclusions. For a particular idea to take hold in the scienti�ic literature, it must be replicated, or repeated in different contexts. Replication can take one of four forms. First, exact replication involves trying to recreate the original experiment as closely as possible to verify the �indings. This type of replication is often the �irst step following a surprising result, and it helps researchers to gain more con�idence in the patterns.
The second and much more common method, conceptual replication, involves testing the relationship between conceptual variables using new operational de�initions. Conceptual replications would include testing aggression hypotheses using new measures or examining the link between alcohol and condom use in different settings. For example, rejection might be operationalized in one study by having participants be chosen last for a group project. A conceptual replication might take a different approach, operationalizing rejection by having participants be ignored during a group conversation or voted out of the group. Likewise, a conceptual replication might change the operationalization of aggression, with one study measuring the delivery of loud blasts of noise and another measuring the amount of hot sauce that people give to their rejecters. Each variation studies the same concept (aggression or rejection) but uses slightly different operationalizations. If all of these variations yield similar results, this further supports the underlying ideas—in this case, that rejection causes people to be more aggressive.
The third method, participant replication, involves repeating the study with a new population of participants. These types of replication are usually driven by a compelling theory as to why the two populations differ. For example, we might reasonably hypothesize that the decision to use condoms is guided by a different set of considerations among college students than among older, single adults. Or, we might hypothesize that different cultures around the world might have different responses to being rejected from a group.
Finally, constructive replication re-creates the original experiment but adds elements to the design. These additions are typically designed to either rule out alternative explanations or extend knowledge about the variables under study. Our rejection and aggression example might compare the impact of being rejected by a group versus by an individual.
Internal Versus External Validity
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 14/40
This chapter has focused on two ways to assess validity in the context of experimental designs. Internal validity assesses the degree to which results can be attributed to independent variables; external validity assesses how well results generalize beyond the speci�ic conditions of the experiment. In an ideal world, studies would have a high degree of both of these. That is, we would feel completely con�ident that our independent variable was the only cause of differences in our dependent variable, and our experimental paradigm would perfectly capture the real- world phenomenon under study.
Reality, though, often demands a trade-off between internal and external validity. In MacDonald et al.’s (1996) study on condom use, the researchers sacri�iced some realism in order to conduct a tightly controlled study of participants’ intentions. In Berkowitz and LePage’s (1967) study on the effect of weapons, the researchers risked the presence of a demand characteristic in order to study reactions to actual weapons. These types of trade-offs are always made based on the goals of the experiment.
Research: Applying Concepts
Balancing Internal Versus External Validity
To give you a better sense of how researchers make the compromises involving internal and external validity, consider the following �ictional scenarios.
Scenario 1—Time Pressure and Stereotyping
Dr. Bob is interested in whether people are more likely to rely on stereotypes when they are in a hurry. In a well-controlled laboratory experiment, he asks participants to categorize ambiguous shapes as either squares or circles, and half of these participants are given a short time limit to accomplish the task. The independent variable is the presence or absence of time pressure, and the dependent variable is the extent to which people use stereotypes in their classi�ication of ambiguous shapes. Dr. Bob hypothesizes that people will be more likely to use stereotypes when they are in a hurry because they will have fewer cognitive resources to consider carefully all aspects of the situation. Dr. Bob takes great care to have all participants meet in the same room. He uses the same research assistant every time, and the study is always conducted in the morning. Consistent with his hypothesis, Dr. Bob �inds that people seem to use shape stereotypes more under time pressure.
The internal validity of this study appears high—Dr. Bob has controlled for other in�luences on participants’ attention span by collecting all of his data in the morning. He has also minimized error variance by using the same room and the same research assistant. In addition, Dr. Bob has created a tightly controlled study of stereotyping through the use of circles and squares. Had he used photographs of people (rather than shapes), the attractiveness of these people might have in�luenced participants’ judgments. The study, however, has a trade-off: By studying the social phenomenon of stereotyping using geometric shapes, Dr. Bob has removed the social element of the study, thereby posing a threat to mundane realism. The psychological meaning of stereotyping shapes is rather different from the meaning of stereotyping people, which makes this study relatively low in external validity.
Scenario 2—Hunger and Mood
Dr. Jen is interested in the effects of hunger on mood; not surprisingly, she predicts that people will be happier when they are well fed. She tests this hypothesis with a lengthy laboratory experiment, requiring participants to be con�ined to a laboratory room for 12 hours with very few distractions. Participants have access to a small pile of magazines to help pass the time. Half of the participants are allowed to eat during this time, and the other half is deprived of food for the full 12 hours. Dr. Jen—a naturally friendly person— collects data from the food-deprivation groups on a Saturday afternoon, while her grumpy research assistant, Mike, collects data from the well-fed group on a Monday morning. Her independent variable is food
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 15/40
deprivation, with participants either not deprived of food or deprived for 12 hours. Her dependent variable consists of participants’ self-reported mood ratings. When Dr. Jen analyzes the data, she is shocked to discover that participants in the food-deprivation group are much happier than those in the well-fed group.
Compared to our �irst scenario, this study seems high on external validity. To test her predictions about food deprivation, Dr. Jen actually deprives her participants of food. One possible problem with external validity is that participants are con�ined to a laboratory setting during the deprivation period with only a small pile of magazines to read. That is, participants may be more affected by hunger when they do not have other things to distract them. In the real world, people are often hungry but distracted by paying attention to work, family, or leisure activities. Dr. Jen, though, has sacri�iced some external validity for the sake of controlling how participants spend their time during the deprivation period. The larger problem with her study has to do with internal validity. Dr. Jen has accidentally confounded two additional variables with her independent variable: Participants in the deprivation group have a different experimenter and data are collected at a different time of day. Thus, Dr. Jen’s surprising results most likely re�lect that everyone is in a better mood on Saturday than on Monday and that Dr. Jen is more pleasant to spend 12 hours with than Mike is.
Scenario 3—Math Tutoring and Graduation Rates
Dr. Liz is interested in whether specialized math tutoring can help increase graduation rates among female math majors. To test her hypothesis, she solicits female volunteers for a math-skills workshop by placing �liers around campus, as well as by sending email announcements to all math majors. The independent variable is whether participants are in the math skills workshop, and the dependent variable is whether participants graduate with a math degree. Those who volunteer for the workshop are given weekly skills tutoring, along with informal discussion groups designed to provide encouragement and increase motivation. At the end of the study, Liz is pleased to see that participants in the workshops are twice as likely as nonparticipants to stick with the major and graduate.
The obvious strength of this study is its external validity. Dr. Liz has provided math tutoring to math majors, and she has observed a difference in graduation rates. Thus, this study is very much embedded in the real world. However, this external validity comes at a cost to internal validity. The study’s biggest �law is that Dr. Liz has recruited volunteers for her workshops, resulting in selection bias for her sample. People who volunteer for extra math tutoring are likely to be more invested in completing their degree and might also have more time available to dedicate to their education. Dr. Liz would also need to be mindful of how many people drop out of her study. If signi�icant numbers of participants withdraw, she could have a problem with differential attrition, so that the most motivated people stayed with the workshops. Dr. Liz can �ix this study with relative ease by asking for volunteers more generally and then randomly assigning these volunteers to take part in either the math tutoring workshops or a different type of workshop. While the sample might still be less than random, Dr. Liz would at least have the power to assign participants to different groups.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 16/40
5.4 Experimental Design The process of designing experiments boils down to deciding what to manipulate and how to do it. This section covers two broad issues related to experimental design: deciding how to structure the levels, or different versions of an independent variable, and deciding on the number of independent variables necessary to test the hypotheses. While these decisions may seem tedious, they are at the crux of designing successful experiments, and are, therefore, the key to performing successful tests of hypotheses.
Levels of the Independent Variable
The primary goal in designing experiments is to ensure that the levels of independent variables are equivalent in every way but one. This is what allows researchers to make causal statements about the effects of that single change. These levels can be formed in one of two broad ways: representing two distinct groups of people or representing the same group of people over time.
Between-Subject Designs In most of the examples discussed so far, the levels of independent variables have represented two distinct groups— participants are in either the control group or the experimental group. This type of design is referred to as a between-subject design because the levels differ between one subject and the next. Each participant who enrolls in the experiment is exposed to only one level of the independent variable—for example, either the experimental or the control group. Most of the examples so far have been illustrations of between-subject designs: participants receive either alcohol or a placebo; students read an article designed to prime either their Asian or their female identity; and graduate students train rats that are falsely labeled either bright or dull. The “either-or” between-subject approach is common and has the advantage of using distinct groups to represent each level of the independent variable. In other words, participants who are asked to consume alcohol are completely distinct from those asked to consume the placebo drink. However, the between-subject approach is only one option for structuring the levels of the independent variable. This section examines two additional ways to structure these levels.
Within-Subject Designs In some cases, the levels of the independent variable can represent the same participants at different time periods. This type of design is referred to as a within-subject design because the levels differ within individual participants. Each participant who enrolls in the experiment would be exposed to all levels of the independent variable. That is, every participant would be in both the experimental and the control group. Within-subject designs are often used to compare changes over time in response to various stimuli. For example, a researcher might measure anxiety symptoms before and after people are locked in a room with a spider, or measure depression symptoms before and after people undergo drug treatment.
Within-subject designs have two main advantages over between-subject designs. First, because the same people constitute both levels of the IV, these designs require fewer participants. Suppose we decide to collect data from 20 participants at each level of an IV. In a between-subject design with three levels, we would need 60 people. However, if we run the same experiment as a within-subject design—exposing the same group of people to three different sets of circumstances—we would need only 20 people. Thus, within-subject designs are often a good way to conserve resources.
Second, participants also serve as their own control group, allowing the researcher to minimize a major source of error variance. Remember that one key feature of experimental design is the researcher’s power to assign people to groups to distribute subject differences randomly across the levels of the IV. Using a within-subject design solves the problem of subject differences in another way, by examining changes within people. For instance, in the study of spiders and anxiety, some participants are likely to have higher baseline anxiety than others. By measuring changes in anxiety in the same group of people before and after spider exposure, we are able to minimize the effects of individual differences.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 17/40
Wavebreakmedia/iStock/Thinkstock
Carryover effects can be understood through the example of monitoring people’s reactions to different �ilm clips. How they feel about one image may in�luence how they react to the next image.
Disadvantages of Within-Subject Designs Within-subject designs also have two clear disadvantages compared to between-subject designs. First, they pose the risk of carryover effects, in which the effects of one level are still present when another level is introduced. Because the same people are exposed to all levels of the IV, it can be dif�icult to separate the effects of one level from the effects of the others. One common paradigm in emotion research is to show participants several �ilm clips that elicit different types of emotion. People might view one clip showing a puppy playing with a blanket, another showing a child crying, and another showing a surgical amputation. Even without seeing these clips in full color, we can imagine that it would be hard to shake off the disgust triggered by the amputation to experience the joy triggered by the puppy.
When researchers use a within-subject design, they take steps to minimize carryover effects. In studies of emotion, for example, researchers typically show a brief neutral clip—like waves rolling onto a beach—after each emotional clip, so that participants experience each emotion after viewing a benign image. Another simple technique is to collect data from the baseline control condition �irst whenever possible. In the study of spiders and anxiety, it would be important to measure baseline anxiety at the start of the experiment before exposing people to spiders. Once people have been surprised by a spider, it will be hard to get them to relax enough to collect control ratings of anxiety.
Second, within-subject designs risk order effects, meaning that the order in which levels are presented can moderate their effects. Order effects fall into two categories. The practice effect happens when participants’ performance improves over time simply due to repeated attempts. This is a particular problem in studies that examine learning. Say we use a within-subject design to compare two techniques for
teaching people to solve logic problems. Participants would learn technique A, then take a logic test, then learn technique B, and then take a second logic test. The possible problem is that participants will have had more opportunities to practice logic problems by the time they take the second test. This makes it dif�icult to separate the effects of practicing the logic problems from the effects of using different teaching techniques.
The �lipside of practice effects is the phenomenon of the fatigue effect, which happens when participants’ performance decreases over time due to repeated testing. Imagine running a variation of the above experiment, teaching people different ways to improve their reaction time. Participants might learn each technique and have their reaction time tested several times after each one. The problem is that people gradually start to tire, and their reaction times slow down due to fatigue. Thus, it would be dif�icult to separate the effects of fatigue from the effects of the different teaching techniques.
The result of both types of order effects is in confounding the order of presentation with the level of the independent variable. Fortunately, researchers have a relatively easy way to avoid both carryover and fatigue effects: a process called counterbalancing. Counterbalancing involves varying the order of presentation to groups of participants. The simplest approach is to divide participants into as many groups as combinations of levels in the experiment. That is, we create a group for each possible order, allowing us to identify the effects of encountering the conditions in different orders. In the examples above, the learning experiments involved two techniques, A and B. To counterbalance these techniques across the study, we divide the participants into two groups. We expose one group to A and then B; we expose the other group to B and then A. When it is time to analyze the data, we will be able to examine the effects of both presentation order and teaching technique. If the order of presentation made a difference, then the A/B group would differ from the B/A group in some way.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 18/40
Mixed Designs The third common way to structure the levels of an IV is using a mixed design, which contains at least one between- subject variable and at least one within-subject variable. So, in the previous example, participants would be exposed to both teaching techniques (A and B) but in only one order of presentation. In this case, teaching technique is a within-subject variable because participants experience both levels, and presentation order is a between-subject variable because participants experience only one level. Because we have one of each in the overall experiment, it is a mixed design.
Studies that compare the effects of different drugs commonly use mixed designs. Imagine we want to compare three new drugs—Drug X, Drug Y, and a placebo control—to determine which has the strongest effects on reducing depression symptoms. To perform this study, we would want to measure depression symptoms on at least three occasions: before starting drug treatment, after a few months of taking the drug, and then again after a few months of stopping the drug (to assess relapse rates). So, our participants would be given one of three possible drugs and then measured at each of three time periods. In this mixed design, measurement time is a within-subject variable because participants are measured at all possible times, while the drug is a between-subject variable because participants experience only one of three possible drugs.
Figure 5.4 shows the hypothetical results of this study. Observe that the placebo pill has no effect on depression symptoms; depression scores in this group are the same at all three measurements. Drug X appears to cause signi�icant improvement in depression symptoms; depression scores drop steadily across measurements in this group. Strangely, Drug Y seems to make depression worse; depression scores increase steadily across measurements in this group. The mixed design allows us both to track people over time and to compare different drugs in one study.
Figure 5.4: Example of a mixed-subjects design
Research: Thinking Critically
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 19/40
Outwalking Depression
Follow the link below to an article from Psychology Today, describing a 2011 research study from the Journal of Psychiatric Research. The study provides new evidence of the bene�its of exercise for people with depression. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.
https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression (https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression)
Think About It
1. Identify the following essential aspects of this experimental design:
a) What are the IV and DV in this study? b) How many levels does the IV have? c) Is this a between-subjects, within-subjects, or mixed design? d) Draw a simple table labeling each condition.
2. a) What preexisting differences between groups should the researchers be sure to take into account? Name as many as you can. b) How should the researchers assign participants to the conditions in order to ensure that preexisting differences cannot account for the results?
3. How might expectancy effects in�luence the results of this study? Can you think of any ways to control for this?
4. Brie�ly state how you would replicate this study in each of the following ways:
a) exact replication b) conceptual replication c) participant replication d) constructive replication
One-Way Versus Factorial Designs
The second big issue in creating experimental designs is to decide how many independent variables to manipulate. In some cases, we can test our hypotheses by manipulating a single IV and measuring the outcome—such as giving people either alcohol or a placebo drink and measuring the intention to use condoms. In other cases, hypotheses involve more complex combinations of variables. Earlier, the chapter discussed research �indings that people tend to act more aggressively after a peer group has rejected them—a single independent variable. Researchers could, however, extend this study and ask what happens when people are rejected by members of the same sex versus members of the opposite sex. We could go one step further and test whether the attractiveness of the rejecters matters, for a total of three independent variables. These examples illustrate two broad categories of experimental design, known as one-way and factorial designs.
One-Way Designs If a study involves assigning people to either an experimental or control group and measuring outcomes, it has a one-way design, or a design that has only one independent variable with two or more levels to the variable. These tend to be the simplest experiments and have the advantage of testing manipulations in isolation. The majority of
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 20/40
drug studies use one-way designs. These types of study compare the effects on medical outcomes for people randomly assigned, for instance, to take the antidepressant drug Prozac or a placebo. Note that a one-way design can still have multiple levels—in many cases it is preferable to test several different doses of a drug. So, for example, we might test the effects of Prozac by assigning people to take doses of 5 mg, 10 mg, 20 mg, or a placebo control. The independent variable would be the drug dose, and the dependent variable would be a change in depression symptoms. This one-way design would allow us to compare all three of the drug doses to a placebo control, as well as to test the effects of varying doses of the drug. Figure 5.5 shows hypothetical results from this study. We can see that even those receiving the placebo showed a drop in depression symptoms, with the 10-mg dose of Prozac producing the maximum bene�it.
Figure 5.5: Comparing drug doses in a one-way design
Factorial Designs Despite the appealing simplicity of one-way designs, experiments conducted in the �ield of psychology with only one IV are relatively rare. The real world is much more complicated, so studies that focus on people’s thoughts, feelings, and behaviors must somehow capture this complexity. Thus, the rejection-and-aggression example above is not that farfetched. If a researcher wanted to manipulate the occurrence of rejection, the gender of the rejecters, and the attractiveness of the rejecters in a single study, the experiment would have a factorial design. Factorial designs are those that have two or more independent variables, each of which has two or more levels. When experimenters use a factorial design, their purpose is to observe both the effects of individual variables and the combined effects of multiple variables.
Factorial designs have their own terminology to re�lect the fact that they include both individual variables and combinations of variables. The beginning of this chapter explained that the versions of an independent variable are referred to as both levels and conditions, with a subtle difference between the two. This difference becomes relevant to the discussion of factorial designs. Speci�ically, levels refer to the versions of each IV, while conditions refer to the groups formed by combinations of IVs. Consider one variation of the rejection-and- aggression example from this perspective: The �irst IV has two levels because participants are either rejected or not rejected. The second IV also has two levels because members of the same sex or the opposite sex do the rejecting. To determine the number of conditions in this study, we calculate the number of different experiences that participants can have in the study. This is a simple matter of multiplying the levels of separate variables, so two multiplied by two, for a total of four conditions.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 21/40
Researchers also have a way to quickly describe the number of variables in their design: A two-way design has two independent variables; a three-way design has three independent variables; an eight-way design has eight independent variables, and so on. Even more useful, the system of factorial notation offers a simple way to describe both the number of variables and the number of levels in experimental designs. For instance, we might describe our design as a 2 × 2 (pronounced “two by two”), which instantly communicates two things: (1) the study uses two independent variables, indicated by the presence of two separate numbers and (2) each IV has two levels, indicated by the number 2 listed for each one.
The 2 × 2 Design
One of the most common factorial designs also happens to be the simplest one—the 2 × 2 design. As noted above, these designs have two independent variables, with two levels each, for a total of four experimental conditions. The simplicity of these designs makes them a useful way to become more comfortable with some of the basic concepts of experiments. This section will explore an example of a 2 × 2 and analyze it in detail.
Beginning in the late 1960s, social psychologists developed a keen interest in understanding the predictors of helping behavior. This interest was inspired, in large part, by the tragedy of Kitty Genovese, who was killed outside her apartment building while none of her neighbors called the police (Gansberg, 1964). As Chapter 2 (2.1) discussed, in one representative study, Princeton psychologists John Darley and Bibb Latané examined people’s likelihood of responding to a staged emergency. Participants were led to believe that they were taking part in a group discussion over an intercom system, but in reality, all of the other participants were prerecorded. The key independent variable was the number of other people supposedly present, ranging from two to six. A few minutes into the conversation, one participant appeared to have a seizure. The recording went like this (actual transcript; Darley & Latané, 1968):
I could really-er-use some help so if someone would-er-give me a little h-hel-puh-er-er-er c-could somebody er-er-hel-er-uh-uh-uh [choking sounds] . . . I’m gonna die-er-er-I’m . . . gonna die-er-hel-er-er- seizure-er [chokes, then quiet].
What do people do in this situation? Do they help? How long does it take? Darley and Latané discovered that two things happen as the group became larger: People were less likely to help at all, and those who did help took considerably longer to do so. Researchers concluded from this and other studies that people are less likely to help when other people are present because the responsibility for helping is “diffused” among the members of the crowd (Darley & Latané, 1968).
Building on this earlier conclusion, the sociologist Jane Piliavin and her colleagues (Piliavin, Piliavin, & Rodin, 1975) explored the in�luence of two additional variables on helping behavior. The experimenters staged an emergency on a New York City subway train in which a person who was in on the study appeared to collapse in pain. Piliavin and her team manipulated two variables in their staged emergency. The �irst independent variable was the presence or absence of a nearby medical intern, who could be easily identi�ied in blue scrubs. The second independent variable was the presence or absence of a large dis�iguring scar on the victim’s face. The combination of these variables resulted in four conditions, as Table 5.2 shows. The dependent variable in this study was the percentage of people taking action to help the confederate.
Table 5.2: 2 × 2 Design of the Piliavin et al. study
No intern Intern
No scar 1 2
Scar 3 4
The authors predicted that bystanders would be less likely to help if a perceived medical professional was nearby since he or she was considered more quali�ied to help the victim. They also predicted that people would be less likely to help when the confederate had a large scar because previous research had demonstrated convincingly that people avoid contact with those who are dis�igured or have other stigmatizing conditions (e.g., Goffman, 1963). As
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 22/40
Figure 5.6: Sample 2 × 2 design: Results from Piliavan et al. (1975)
Piliavan et al. (1975)
Figure 5.6 reveals, the results supported these hypotheses. Both the presence of a scar and the presence of a perceived medical professional reduced the percentage of people who came to help. Nevertheless, something else is apparent in these results: When the confederate was not scarred, having an intern nearby led to a small decrease in helping (from 88% to 84%). However, when the confederate had a large facial scar, having an intern nearby decreased helping from 72% to 48%. In other words, it seems these variables are having a combined effect on helping behavior. The next section examines these combined effects more closely.
Main Effects and Interactions
When experiments involve only one independent variable, the analyses can be as simple as comparing two group means—as did the example in Chapter 1, which compared the happiness levels of couples with and without children. But what about cases where the design has more than one independent variable?
A factorial design has two types of effects: A main effect refers to the effect of each independent variable on the dependent variable, averaging values across the levels of other variables. A 2 × 2 design has two main effects; a 2 × 2 × 2 design has three main effects because there are three IVs. An interaction occurs when the variables have a combined effect; that is, the effects of one IV are different depending on the levels of the other IV. So, applying this new terminology to the Piliavin et al. (1975) “subway emergency” study, produces three possible results (“possible,” because we would need to use statistical analyses to verify them):
1. The main effect of scar: Does the presence of a scar affect helping behavior? Yes. More people help in absence of a facial scar. Figure 5.6 indicates that the bars on the left (no scar) are, on average, higher than those on the right (scar).
2. The main effect of intern: Does the presence of an intern affect helping behavior? Yes. More people help when no medical intern is on hand. Note that in Figure 5.6, the red bars (no intern) are, on average, higher than the tan bars (intern).
3. The interaction between scar and intern: Does the effect of one variable depend on the effect of another variable? Yes. Refer to Figure 5.6 and observe that the presence of a medical intern matters more when the victim has a facial scar. In visual terms, the gap between red and tan bars is much larger in the bars on the right. This indicates an interaction between scar and intern.
Consider a �ictional example. Imagine we are interested in people’s perceptions of actors in different types of movies. We might predict that some actors are better suited to comedy and others are better suited to action movies. A simple experiment to test this hypothesis would show four movies in a 2 × 2 design, using the same two actors in two movies (for a total of four conditions). The �irst IV would be the movie type, with two levels: action and comedy. The second IV would be the actor, with two levels: Will Smith and Arnold Schwarzenegger. The dependent variable would be the ratings of each movie on a 10-point scale. This design produces three possible results:
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 23/40
1. The main effect of actor: Do people generally prefer Will Smith or Arnold Schwarzenegger, regardless of the movie?
2. The main effect of movie type: Do people generally prefer action or comedy movies, regardless of the actor? 3. The interaction between actor and movie type: Do people prefer each actor in a different kind of movie? (i.e.,
are ratings affected by the combination of actor and movie type?)
After collecting data from a sample of participants, we end up with the following average ratings for each movie, which Table 5.3 shows.
Table 5.3: Main effects and marginal means: the actor study
Remember that main effects represent the effects of one IV, averaging across the levels of the other IV. To average across levels, we calculate the marginal means, or the combined mean across levels of another factor. In other words, the marginal mean for action movies is calculated by averaging together the ratings of both Arnold Schwarzenegger and Will Smith in action movies. The marginal mean for Arnold Schwarzenegger is calculated by averaging together ratings of Arnold Schwarzenegger in both action and comedy movies. Performing these calculations for our 2 × 2 design results in four marginal means, which are presented alongside the participant ratings in Table 5.3. To verify these patterns would require statistical analyses, but it appears that people have a slight preference for comedy over action movies, as well as a slight preference for Arnold Schwarzenegger’s acting over Will Smith’s acting.
What about the interaction? The main hypothesis here posits that some actors perform best in some genres of movies (e.g., action or comedy) than they do in other genres, which suggests that the actor and the movie type have a combined effect on people’s ratings of the movies. Examining the means in Table 5.3 conveys a sense of this �inding, but it is much easier to appreciate in a graph. Figure 5.7 shows the mean of participants’ ratings across the four conditions. If we focus �irst on the ratings of Arnold Schwarzenegger, we can see that participants did have a slight preference for him in action (6) versus comedy (5) roles. Then, examining ratings of Will Smith, we can see that participants had a strong preference for him in comedy (8) versus action (1.5) roles. Together, this set of means indicates an interaction between actor and movie type because the effects of one variable depend on another. In plain English: People’s perceptions of an actor depend on the type of movie in which he or she performs. This pattern of results nicely �its for the hypothesis that certain actors are better suited to certain types of movie: Arnold should probably stick to action movies, and Will should de�initely stick to comedies.
Figure 5.7: Interaction in the actor study
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 24/40
Before moving on to the logic of analyzing experiments, consider one more example from a published experiment. A large body of research in social psychology suggests that stereotypes can negatively affect performance on cognitive tasks (e.g., tests of math and verbal skills). According to Stanford social psychologist Claude Steele and his colleagues, individuals’ fear of con�irming negative stereotypes about their group acts as a distraction. This distraction—which the researchers term stereotype threat—makes it hard to concentrate and perform well, and thus leads to lower scores on a cognitive test (Steele, 1997). One of the primary implications of this research is that ethnic differences in standardized-test scores can be viewed as a situational phenomenon—change the situation, and the differences go away. In the �irst published study of stereotype threat, Claude Steele and Josh Aronson (1995) found that when African-American students at Stanford were asked to indicate their race before taking a standardized test, this was enough to remind them of negative stereotypes, and they performed poorly. When the testing situation was changed, however, and participants were no longer asked their race, the students performed at the same level as Caucasian students. Worth emphasizing is that these were Stanford students and had therefore met admissions standards for one of the best universities in the nation. Even this group of elite students was susceptible to situational pressure but performed at their best when the pressure was eliminated.
In a great application of stereotype threat, social psychologist Jeff Stone at the University of Arizona asked both African-American and Caucasian college students to try their hands at putting on a golf course (Stone, Lynch, Sjomeling, & Darley, 1999). Putting was described as a test of natural athletic ability to half of the participants and as a test of sports intelligence to the other half. Thus, the experiment had two independent variables: the race of the participants (African-American or Caucasian) and the description of the task (“athletic ability” or “sports intelligence”). Note that “race” in this study is technically a quasi-independent variable because it is not manipulated. This design resulted in a total of four conditions, and the dependent variable was the number of putts that participants managed to make. Stone and colleagues hypothesized that describing the task as a test of athletic ability would lead Caucasian participants to worry about the stereotypes regarding their poor athletic ability. In contrast, describing the task as a test of intelligence would lead African-American participants to worry about the stereotypes regarding their lower intelligence.
Consistent with their hypotheses, Stone and colleagues found an interaction between race and task description but no main effects. That
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 25/40
Comstock Images/Stockbyte/Thinkstock
Skill on the golf course was used to study stereotypes in an experiment conducted by Jeff Stone at the University of Arizona.
is, neither race was better at the putting task overall, and neither task description had an overall effect on putting performance. The combination of these variables, though, proved fascinating. When researchers described the task as measuring sports intelligence, the African-American participants did poorly due to fear of con�irming negative stereotypes about their overall intelligence. Conversely, when researchers described the task as measuring natural athletic ability, the Caucasian participants did poorly due to fear of con�irming negative stereotypes about their athleticism. This study beautifully illustrates an interaction; the effects of one variable (task description) depend on the effects of another (race of participants). The results further con�irm the power of the situation: Neither group did better or worse overall, but both were responsive to a situationally induced fear of con�irming negative stereotypes.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 26/40
Figure 5.8: Comparing sources of variance
5.5 Analyzing Data From Experiments So far, we have been drawing conclusions about experimental �indings using conceptual terms. But naturally, before we actually make a decision about the status of our hypotheses, we have to conduct statistical analyses. This section provides a conceptual overview of the most common statistical techniques for analyzing experimental data.
Dealing With Multiple Groups
Why do researchers need a special technique for experimental designs? After all, we learned in Chapter 2 (2.4) that we can compare two pairs of means using a t test; why not use several t tests to analyze our experimental designs? For the movie ratings study, we could analyze the data using a total of six t tests to capture every possible pair of means:
Arnold Schwarzenegger in a comedy versus Will Smith in a comedy; Arnold Schwarzenegger in an action movie versus Will Smith in an action movie; Arnold Schwarzenegger in a comedy versus an action movie; Will Smith in a comedy versus an action movie; Will Smith in a comedy versus Arnold Schwarzenegger in an action movie; and �inally Will Smith in an action movie versus Arnold Schwarzenegger in a comedy.
This approach, however, presents a problem. The odds of making a Type I error (getting excited about a false positive) increase with every statistical test. Researchers typically set their alpha level at 0.05 for a t test, meaning that they are comfortable with a 5% chance of a Type I error. Unfortunately, if we conduct six t tests, each one has a 5% chance of a Type I error, meaning that we have a greater chance of a false-positive result somewhere in the study. In short, we need a statistical approach that reduces the number of comparisons we perform. Fortunately, a statistical technique called the analysis of variance (ANOVA) tests for differences by comparing the amount of variance explained by the independent variables to the variance explained by error.
The Logic of ANOVA
The logic behind an analysis of variance is rather straightforward. As the course has discussed throughout, variability in a dataset can be divided into systematic and error variance. That is, we can attribute some of the variability to the factors being studied, but a degree of random error will always be present. In our movie ratings study, some of the variability in these ratings can be attributed to the independent variables (differences in actors and movie types), while some of the variability is due to other factors— perhaps some people simply like movies more than other people.
The ANOVA works by comparing the in�luence of these different sources of variance. We always want to explain as much of the variance as possible through the independent variables. If the independent variables have more in�luence than random error does, this is good news. If, on the other hand, error variance has more in�luence than the independent variables, this is bad news for the hypotheses. Comparing the three pie charts in Figure 5.8 conveys a sense of this problem. The proportion of variance explained by our independent variables is shaded in tan, while the proportion explained by error is shaded in red. In the top graph, the independent variables explain approximately 80% of the variance, which we can view as a good result. In the middle graph, however, variance is explained equally by the independent
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 27/40
variables and by error, and in the bottom graph, the independent variables explain only 20% of the variance. Thus, in the latter two graphs, the independent variables do no better than random error at explaining the results.
One more analogy may be helpful. In the �ield of engineering, the term signal-to-noise ratio is used to describe the amount of light, sound, energy, etc., that is detectable above and beyond background noise. This ratio is high when the signal comes through clearly and low when it is mixed with static or other interference. Likewise, when someone tries to tune in a favorite radio station, the goal is to �ind a clear signal that is not covered up by static. Believe it or not, the ANOVA statistic (symbolized F) is doing the same thing. That is, the analysis tells us whether differences in experimental conditions (signal) are detectable above and beyond error variance (noise).
Research: Thinking Critically
Love Ballad Leaves Women More Open to a Date
Follow the link below to a press release describing a 2010 study from the journal Psychology of Music. The study suggests that listening to love ballads may make women more likely to give their phone number to someone they have just met. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.
http://www.sciencedaily.com/releases/2010/06/100618112139.htm (http://www.sciencedaily.com/releases/2010/06/100618112139.htm)
Think About It
1. In this experiment, the type of song (love song or neutral song) is confounded with at least one other variable. Try to identify one. Do you think that this confounded variable would make a difference? How would you design a study that overcomes this?
2. Describe how demand characteristics might compromise the internal validity of this study. Can you think of any ways around this?
3. Toward the end of the article, the authors suggest that one explanation for these results is that the romantic music put the women into a more positive mood, and that this in turn made them more receptive to the men. How could you design a study that tests this hypothesis?
4. Given the nature of the DV in this study, would an ANOVA test be appropriate? What would be the more appropriate statistical test, and why?
Exploring the Data
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 28/40
Statistics courses cover ANOVA in more detail, but, despite its elegant simplicity, the test has a notable limitation. After conducting an ANOVA, we have a yes-or-no answer to the following question: Do our experimental groups have a systematic effect on the dependent variable? The answer lets us decide whether to reject the null hypothesis, but it does not tell us everything we want to know about the data. In essence, a signi�icant F value tells us that the groups have a signi�icant difference, but it does not tell us what the difference is. Conducting an ANOVA on our movie-ratings study would reveal a signi�icant interaction between actor and movie, but we would need to take additional steps to determine the meaning of this interaction.
This section will describe the process of exploring and interpreting ANOVA results to make sense of the data. The example is drawn from a published study by Newman, Sellers, and Josephs (2005), which was designed to explore the effects of testosterone on cognitive performance. Previous research had suggested that testosterone was involved in two types of complex human behavior. On one hand, people with higher testosterone tend to perform better on tests of spatial skills, such as having to rotate objects mentally, and perform worse on tests of verbal skills, such as listing all the synonyms for a particular word. These patterns are thought to re�lect the in�luence of testosterone on developing brain structures. On the other hand, people with higher testosterone are also more concerned with gaining and maintaining high status relative to other people. Testosterone correlates with a person’s position in the hierarchy and tends to rise and fall when people win and lose competitions, respectively. Sociologist Alan Mazur and his colleagues measured testosterone levels before, during, and after a series of professional chess matches. They found that testosterone rose in both players in anticipation of the competition, then rose even further in the winners, but plummeted in the losers (Mazur, Booth, & Dabbs, 1992).
Newman and colleagues (2005) set out to test the combination of these variables. Based on previous research, they hypothesized that people with higher testosterone would be uncomfortable when they were placed in a low-status position, leading them to perform worse on cognitive tasks. The researchers tested this hypothesis by randomly assigning people to a high status, low status, or control condition, and then administering a spatial and a verbal test. The resulting between-subjects design was a 2 (testosterone: high or low) × 3 (condition: high status, low status, control), for a total of six groups. Note that “testosterone” in this study is a quasi-independent variable, because it is measured rather than manipulated by the experimenters.
Once the results were in, the ANOVA revealed an interaction between testosterone and status but no main effects. Figure 5.9 shows the results of the study. These bars represent z scores that combine the spatial and verbal tests into one number. So, what do these numbers mean? How do we make sense out of the patterns? Doing so involves a combination of comparing means and calculating effect sizes, as we discuss next.
Figure 5.9: Exploring the data: Results from Newman et al. (2005)
Newman et al. (2005)
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 29/40
Mean Comparisons The �irst step in interpreting results is to compare the various pairs of means within the design. This might seem counterintuitive, since the whole point of the ANOVA was to test for effects without comparing individual means. Our goal, therefore, is to somehow explore differences in conditions without in�lating Type I error rates. Achieving this balance involves two strategies.
Planned comparisons (also called a priori comparisons) involve comparing only the means for which differences were predicted by the hypothesis. In the experiment by Newman et al. (2005), the hypothesis explicitly stated that high-testosterone people should perform better in a high-status position than a low-status position. So, a planned comparison for this prediction would involve comparing two means with a t test: high T, high status (the highest red bar); and high T, low status (the lowest tan bar). Consistent with the researchers’ hypothesis, high-testosterone people did perform higher on both tests, t(27) = 2.35, p = 0.01, but only in a high-status position. Type I errors are of less concern with planned comparisons because only a small number of theoretically driven comparisons are being conducted.
Referring to the graph of these results in Figure 5.9 and comparing high- with low- testosterone people reveals another interesting pattern: In a high-status position, high-testosterone people do better than low-testosterone people, but in a low-status position, this pattern is reversed, and high-testosterone people do worse. However, the researchers did not predict these mean comparisons, so to do planned contrasts would be cheating. Instead, they would use a second strategy called a post hoc comparison, which controls the overall alpha by taking into account the fact that multiple comparisons are being performed. In most cases, research only permits post hoc tests if the overall F test is signi�icant.
One popular way to conduct post hoc tests while minimizing the error rate is to use a technique called a Bonferroni correction. This technique, named after the Italian mathematician who developed it, involves simply adjusting the alpha level by the number of comparisons that are performed. For example, imagine we want to conduct 10 follow- up post hoc tests to explore the data. The Bonferroni correction would involve dividing the alpha level (0.05) by the number of comparisons (10), for a corrected alpha level of 0.005. Then, rather than using a cutoff of 0.05 for each test, we use this more conservative Bonferroni-corrected value of 0.005. Translation: Rather than accepting a Type I error rate of 5%, we are moving to a more conservative 0.5% cutoff to correct for the number of comparisons that we are performing.
Another popular alternative to the Bonferroni correction is called Tukey’s HSD (for Honestly Signi�icant Difference). This test works by calculating a critical value for mean comparisons (the HSD), and then using this critical value to evaluate whether mean comparisons are signi�icantly different. The test manages to avoid in�lating Type I error because the HSD is calculated based on the sample size, the number of experimental conditions, and the MSWG, which essentially tests all the comparisons at once. In the study by Newman et al. (2005), both of these post hoc tests were signi�icant: Compared to those low in testosterone, high-testosterone people did better in a high-status position but worse in a low-status position, suggesting that high testosterone magni�ies the effect of testing situations on cognitive performance.
Effect Size Statistical signi�icance is only part of the story; researchers also want to know how big the effects of their independent variables are. Researchers can calculate effect size using several ways, but in general, bigger values mean a stronger effect. One of these statistics, Cohen’s d, is calculated as the difference between two means divided by their pooled standard deviation. The resulting values can therefore be expressed in terms of standard deviations; a d of 1 means that the means are one standard deviation apart. How big should we expect our effects to be? Based on Cohen’s analyses of typical effect sizes in the social sciences, he suggests the following benchmarks: d = 0.20 is a small effect; d = 0.40 is a moderate effect; and d = 0.60 is a large effect. In addition to these qualitative categories, effect-size values can be interpreted in terms of standard deviation units. So, a d of 1 is equivalent to a standard deviation of 1. In other words, a large effect in social and behavioral sciences accounts for a little more than half of a standard deviation.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 30/40
In interpreting the results of their testosterone experiment, Newman and colleagues (2005) computed effect-size measurements for two of the key mean comparisons. First, they compared high-testosterone people in the high- and low-status conditions; the size of this effect was a d = 0.78. Second, they compared the high- and low-testosterone people in the low-status condition; the size of this effect was a d = 0.61. Both of these effects fall in the “large” range based on Cohen’s benchmarks. More important, taken together with the mean comparisons, they help us to understand the way testosterone affects behavior. The authors conclude that cognitive performance stems from an interaction between biology (testosterone) and environment (assigned status) such that high-testosterone people are more responsive to their status in a given situation. When they are placed in a high-status position, they relax and perform well. Conversely, when placed in a low-status position, they become distracted and perform poorly. Researchers reach this nuanced conclusion only through an exploration of the data, using mean comparisons and effect-size measures.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 31/40
5.6 Wrap-Up: Avoiding Error As this �inal chapter concludes, it is worth thinking back to one of the key concepts in Chapter 2 (2.4): Type I and Type II errors. Regardless of the research question, the hypothesis, or the particulars of the research design, all studies have the goal of making accurate decisions about the hypotheses. That is, we need to be able to correctly reject the null hypothesis when it is false, and fail to reject the null when it is true. Still, from time to time and despite our best efforts, we make mistakes when we draw conclusions about our hypotheses, as Table 5.4 summarizes. A Type I error, or “false positive,” involves falsely rejecting a null hypothesis and becoming excited about an effect that is due to chance. A Type II error, or “false negative,” involves failing to reject the null hypothesis and missing an effect that is real and interesting. (For a refresher on these terms, refer back to Chapter 2.)
Table 5.4: Review of Type I and Type II errors
Researcher’s Decision
Reject Null Fail to Reject Null
Null is FALSE Correct Decision Type II Error
Null is TRUE Type I Error Correct Decision
This section takes a problem-solving approach to minimizing both of these errors in an experimental context. It turns out that each error is primarily under the researcher’s control at different stages in the research process, which means reducing each error calls for different strategies.
Avoiding Type I Error
Type I errors occur when results are due to chance but are mistakenly interpreted as signi�icant. We can generally reduce the odds of this happening by setting our alpha level at p < 0.05, meaning that we will only be excited about results that have less than a 5% chance of Type I error. However, Type I errors can still occur as a result of either extremely large samples or large numbers of statistical comparisons. Large samples can make small effects seem highly signi�icant, so it is important to set a more conservative alpha level in large-scale studies. And, this chapter has discussed, the odds of Type I error are compounded with each statistical test we conduct.
What this means is that Type I error is primarily under researchers’ control during statistical analysis—the smarter the statistics, the lower the odds of Type I error. This chapter has discussed several examples of “smart” statistics: Instead of conducting lots of t tests, we use an ANOVA to test for differences across the entire design simultaneously. Instead of conducting t tests to compare means after an ANOVA, we use a mix of planned contrasts (for comparisons that we predicted) and post hoc tests (for other comparisons we want to explore). More advanced statistical techniques take this a step further. For example, the multivariate analysis of variance (MANOVA) statistic analyzes sets of dependent variables to reduce further the number of individual tests. Researchers use this approach when dependent variables represent different measures of a related concept, such as using heart rate, blood pressure, and muscle tension to capture the stress response. The MANOVA works, broadly speaking, by computing a weighted sum of these separate DVs (called a canonical variable) and using this new variable as the dependent variable. To learn more about this and other advanced statistical techniques, see the excellent volume by James Stevens (2002), Applied Multivariate Statistics.
Avoiding Type II Error
Type II errors occur when a real underlying relationship exists between the variables, but the statistical tests are nonsigni�icant. The primary sources of this error are small samples and bad design. Small samples may fail to capture enough variability and may therefore lead to nonsigni�icant p values in testing an otherwise signi�icant effect. Both large and small mistakes in experimental designs can add noise to the dataset, making it dif�icult to detect the real effects of independent variables.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 32/40
This means that Type II error is primarily under the researcher’s control during the design process—the smarter the research designs, the lower the odds of Type II error. First, as Chapter 2 discussed, it is relatively simple to estimate the sample size needed for our research using a power calculator. These tools take basic information about the number of conditions in the research design and the estimated size of the effect and then estimate the number of people needed to detect this effect. (See Chapter 2, Figure 2.5, for an annotated example using one of these online calculators.)
Second, as every chapter has discussed, it is the experimenter’s responsibility to take steps to minimize extraneous variables that might interfere with the hypothesis test. Whether researchers are conducting an observation, a survey study, or an experiment, the overall goal is to ensure that the variables of interest are the main cause of changes in the dependent variable. This is perhaps easiest in an experimental context because these designs are usually conducted in a controlled setting where the experimenter has control over the independent variables. Nonetheless, as the chapter discussed earlier, many factors can threaten the internal validity of an experiment—from confounds to sample bias to expectancy effects. In essence, the more we can control the in�luence of these extraneous variables, the more con�idence we can have in the results of the hypothesis test.
Table 5.5 presents a summary of the information in this section, listing the primary sources of Type I and Type II errors, as well as the time period when these are under experimenter control.
Table 5.5: Summary—avoiding error
Error De�inition Main Source When You Can Control
Type I False-positive Lots of tests; lots of people Conducting stats
Type II False-negative Bad measures; not enough people Designing experiments
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 33/40
Summary and Resources
Chapter Summary This chapter focused on experimental designs, in which the primary goal is to explain behavior in causal terms. The chapter began with an overview of experimental terminology and the key features of experiments. Three key features distinguish experiments from other research designs. First, researchers manipulate a variable, giving them a fair amount of con�idence that the independent variable (IV) causes changes in the dependent variable (DV). Second, researchers control the environment, ensuring that everything about the experimental context is the same for different groups of participants—except for the level of the independent variable. Finally, the researchers have the power to assign participants to conditions using random assignment. This process helps to ensure that preexisting differences among participants (e.g., in mood, motivation, intelligence, etc.) are balanced across the experimental conditions.
Next, the chapter explained the concept of experimental validity. When evaluating experiments, researchers must take into account both internal validity—or the extent to which the IV is the cause of changes in the DV—and external validity—or the extent to which the results generalize beyond the speci�ic laboratory setting. Several factors can threaten internal validity, including experimental confounds, selection bias, and expectancy effects. The common thread among these threats is that they add noise to the hypothesis test and cast doubt on the direct connection between IV and DV. External validity involves two components, the realism of the study and the generalizability of the �indings. Psychology experiments are designed to study real-world phenomena, but sometimes compromises have to be made to study these phenomena in the laboratory. Research often achieves this balance via mundane realism, or replicating the psychological conditions of the real phenomenon. Last, researchers have more con�idence in the �indings of a study when they can be replicated, or repeated in different settings with different measures.
In designing the nuts and bolts of experiments, researchers have to make decisions about both the nature and number of independent variables. First, designs can be described as between-subject, within-subject, or mixed. In a between-subject design, participants are in only one experimental condition and receive only one combination of the independent variables. In a within-subject design, participants are in all experimental conditions and receive all combinations of the independent variables. Finally, a mixed design contains a combination of between- and within- subject variables. In addition, research designs can be described as either one-way or factorial. One-way designs consist of only one IV with at least two levels; factorial designs consist of at least two IVs, each having at least two levels. A factorial design produces several results to examine: the main effect of each IV plus the interaction, or combination, of the IVs.
The chapter also discussed the logic of analyzing experimental data, using the analysis of variance (ANOVA) statistic. This test works by simultaneously comparing sources of variance and therefore avoids the risk of in�lated Type I error. The ANOVA (or F) is calculated as a ratio of systematic variance to error variance, or, more speci�ically, of between-groups variance to within-groups variance. The bigger this ratio, the more experimental manipulations contribute to overall variability in scores. However, the F statistic suggests only that differences exist in the design; further analyses are necessary to explore these differences. The chapter described an example from a published study, discussing the process of comparing means and calculating effect sizes. In comparing means, researchers use a mix of planned contrasts (for comparisons that they predicted) and post hoc tests (for other comparisons they want to explore).
Finally, the chapter concluded by referring to two recurring concepts, Type I error (false positive) and Type II error (false negative). These errors interfere with the broad goal of making correct decisions about the status of a hypothesis. Thus, the purpose of this �inal section was to review ways to minimize errors. Type I errors are primarily in�lated by large samples and lots of statistical analyses. Consequently, this error is under the experimenter’s control at the data-analysis stage. Type II errors are primarily in�lated by small samples and �laws in the experimental design. Consequently, this error is under the experimenter’s control at the design and planning stage.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 34/40
Key Terms
analysis of variance (ANOVA) A statistical procedure that tests for differences by comparing the variance explained by systematic factors to the variance explained by error.
between-subject design Experimental design in which each group of participants is exposed to only one level of the independent variable.
Bonferroni correction A post hoc test that involves adjusting the alpha level by the number of comparisons to set a more conservative cutoff.
carryover effect Effects of one level are present when another level is introduced, making it dif�icult to separate the effects of different levels.
conceptual replication Testing the relationship between conceptual variables using new operational de�initions.
condition One of the versions of an independent variable, forming different groups in the experiment; in a factorial design, refers to the groups formed by combinations of IVs.
confounding variable (or confound) A variable that changes systematically with the independent variable.
constructive replication Recreation of the original experiment that adds elements to the design; usually designed to rule out alternative explanations or extend knowledge about the variables under study.
control condition Group within the experiment that does not receive the experimental treatment.
counterbalancing Variation of the order of presentation among participants to reduce order effects.
cover story A misleading statement to participants about what is being studied to prevent effects of demand characteristics.
demand characteristic Cue in the study that leads participants to guess the hypothesis.
differential attrition Loss of participants, who drop out of experimental groups for different reasons.
environmental manipulation Changing some aspect of the experimental setting.
exact replication Recreation of the original experiment as closely as possible to verify the �indings.
experimental condition Group within the experiment that receives a treatment designed to test a hypothesis.
experimental design
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 35/40
Design whose primary goal is to explain causes of behavior.
experimenter expectancy Researchers see what they expect to see, leading to subtle bias in favor of their hypotheses; threat to internal validity.
external validity A metric that assesses generalizability of results beyond the speci�ic conditions of the experiment.
extraneous variable Variable that adds noise to a hypothesis test.
factorial design A design that has two or more independent variables, each with two or more levels.
factorial notation A system for describing the number of variables and the number of levels in experimental designs.
fatigue effect Decline of participants’ performance as a result of repeated testing.
generalizability The extent to which results extend to other studies, using a wide variety of populations and of operational de�initions.
instructional manipulation Changing the way a task is described to change participants’ mind-sets.
interaction The combined effect of variables in a factorial design; the effects of one IV are different depending on the levels of the other IV.
internal validity A metric that assesses the degree to which results can be attributed to independent variables.
invasive manipulation Taking measures to change internal, physiological processes; usually conducted in medical settings.
level Another way to describe the versions of an independent variable; describes the speci�ic circumstances created by manipulating a variable.
main effect The effect of each independent variable on the dependent variable, collapsing across the levels of other variables.
marginal mean The combined mean of one factor across levels of another factor.
matched random assignment A variation on random assignments; ensures that an important variable is equally distributed between or among the groups; the experimenter obtains scores on an important matching variable, ranks participants on this variable, and then randomly assigns participants to conditions.
mixed design Experimental design that contains at least one between-subject variable and at least one within-subject variable.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 36/40
multivariate analysis of variance (MANOVA) A statistic that analyzes sets of dependent variables to reduce the number of individual tests.
mundane realism Research that replicates the psychological conditions of the real-world phenomenon; criterion for judging external validity.
one-way design A design that has only one independent variable, with two or more levels to the variable.
order effect Moderation of the effects because of the order in which levels occur.
participant replication Repetition of the study with a new population of participants; usually driven by a compelling theory as to why the two populations differ.
placebo control Group added to a study to reduce placebo effects; mimics the experimental condition in every way but one.
placebo effect Change resulting from the mere expectation that change will occur.
planned comparison (or a priori comparison) Comparisons that involve comparing only the means for which differences were predicted by the hypothesis.
post hoc comparison Comparison that controls the overall alpha by taking into account that multiple comparisons are being performed; usually allowed only if the overall F test is signi�icant.
practice effect Improvement of participants’ performance as a result of repeated testing.
quasi-independent variable Preexisting difference used to divide participants in an experimental context; referred to as “quasi” because variables are being measured, not manipulated, by the experimenter.
random assignment A technique for assigning participants to conditions; before participants arrive, the experimenter makes a random decision for each participant’s placement in a group.
replication Repetition of research results in different contexts and/or different laboratories.
selection bias Occurs when groups are different before the manipulation; problematic because preexisting differences might be the driving factor behind the results.
Tukey’s HSD (Honestly Signi�icant Difference) A post hoc test that calculates a critical value for mean comparisons (the HSD) and then uses this critical value to evaluate whether mean comparisons are signi�icantly different.
unrelated-experiments technique A strategy for preventing the effects of demand characteristics, leading participants to believe that they are completing two experiments during one session; experimenter can use this to present the independent variable during the �irst experiment and measure the dependent variable during the second experiment.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 37/40
within-subject design Experimental design in which each group of participants is exposed to all levels of the independent variable.
Chapter 5 Flashcards
Apply Your Knowledge 1. List and brie�ly describe the three distinguishing features of an experiment.
a.
b.
c.
2. List the three types of expectancy effect that can affect experimental results, and name one way to avoid each type.
a.
b.
c.
3. The following designs are described using factorial notation. For each one, state (a) the number of variables in the design, (b) the number of levels each variable has, and (c) the total number of experimental conditions.
3 × 3 × 3
a.
b.
Elige un modo de estudioVer esta unidad de estudio
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 38/40
c.
2 × 3 × 4
a.
b.
c.
4 × 4
a.
b.
c.
2 × 2 × 2 × 2
a.
b.
c.
4. Forty students were asked to rate two authors according to their knowledge of certain topic areas. Each student was given two passages to read. In one passage (“Brain”), the author discussed the roles of various brain structures in perceptual-motor coordination. In the second passage (“Motivation”), the author described ways to enhance motivation in preschool children. For half the students, both passages were written by male authors. For the other half of the students, both passages were written by a female author. After reading the passages, students rated the authors’ knowledge of their topic areas on a scale ranging from 1 (displays very little knowledge) to 10 (displays a thorough knowledge).
5. Male Author Female Author
Brain 9 4
Motivation 6 7
(1) Identify the following information about the design: (2) Describe the design using factorial notation (e.g., 4 × 3). (3) Identify the total number of conditions. (4) Identify the design (circle one): between-subject within-subject mixed
6. For each of the following scenarios, identify what a Type I error and a Type II error would look like. Then, determine which type would be a bigger problem for that scenario.
a. A large international airport has received a bomb threat. In response, the airport police have tightened security and now check every piece of luggage manually. (1) Type I: (2) Type II: (3) Bigger problem:
b. Your friend purchases a pregnancy test.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 39/40
Research Scenarios: Try It
(1) Type I: (2) Type II: (3) Bigger problem:
Critical Thinking Questions 1. Explain the advantages and disadvantages of a within-subject design. 2. Compare and contrast the following terms. Your answers should demonstrate that you understand each
term. Be sure to give some kind of context (e.g., “both are types of . . .”) or provide an example, and state how they are different.
a. internal versus external validity b. between-subjects versus within-subject design c. level versus condition
3. Explain the difference between Type I and Type II errors. How can each type of error be minimized?
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 40/40