Qualitative and Quantitative Methods

profilejhenan1
Chapter2PSY326.pdf

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 1/35

Learning Outcomes

By the end of this chapter, you should be able to:

Outline the key features of descriptive, correlational, and experimental research designs. Explain the importance of reliability and validity in designing research studies. Compare and contrast the different scaling methods for measuring variables. Identify the pros and cons of behavioral, physiological, and self-report measures. Describe the process of framing and testing hypotheses.

In the early 1950s, Canadian physician Hans Selye introduced the term stress into both the medical and popular lexicons. By that time, it had been accepted that humans have a well-evolved �ight-or-�light response, which prepares them either to �ight back or �lee from danger, largely by releasing adrenaline and mobilizing the body’s resources more ef�iciently. While working at McGill University, Selye began to wonder about the health consequences of adrenaline and designed an experiment to test his ideas using rats. Selye injected rats with doses of adrenaline over a period of several days and then euthanized the rats in order to examine the physical effects of the injections. Just as he had hypothesized, rats that were exposed to adrenaline had developed ill effects, such as ulcers, increased arterial plaques, and decreases in the size of reproductive glands—all now understood to be

2 Design, Measurement, and Testing Hypotheses

José Antonio Moreno/age fotostock/Superstock

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 2/35

consequences of long-term stress exposure. But there was just one problem. When Selye took a second group of rats and injected them with a placebo, they also developed ulcers, plaques, and shrunken reproductive glands.

Fortunately, Selye was able to solve this scienti�ic mystery with a little self-re�lection. Despite all his methodological savvy, he turned out to be rather clumsy when it came to handling rats, occasionally dropping one when he removed it from its cage for an injection. In essence, the experience for both groups of rats was one that we would now call stressful, and it is no surprise that they developed physical ailments in response. Rather than testing the effects of adrenaline injections, Selye was inadvertently testing the effects of being handled by a clumsy scientist. It is important to note that if Selye ran this study in the present day, ethical guidelines would dictate much more stringent oversight of his procedures to protect the welfare of the animals.

This story illustrates two key points about the scienti�ic process. First, as Chapter 1 discussed, researchers should always be attentive to apparent mistakes because they can lead to valuable insights. Second, it is absolutely vital that researchers actually measure what they think they are measuring—Selye ended up measuring the effects of stress rather than just adrenaline injections. This chapter explains what it means to do research in a more concrete way, beginning with a broad look at the three types of research design. The goal at this stage is to obtain a general sense of what these designs are, when they are used, and the main differences between them. (Chapters 3, 4, and 5 are each dedicated to one type of research design and will elaborate on each one.) Following the overview of designs, this chapter covers a set of basic principles that are common to all research designs. Regardless of the particulars of a given design, all research studies involve making sure measurements are accurate and consistent and that they are captured using the appropriate type of scale. Finally, the chapter will discuss the general process of hypothesis testing, from laying out predictions to drawing conclusions.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 3/35

2.1 Overview of Research Designs As Chapter 1 explained, scientists can have a wide range of goals when they begin a research project, everything from describing a phenomenon to changing people’s behavior. It turns out that these goals will dictate different approaches to answering a research question. That is, researchers will approach the problem of describing voting patterns differently than they would approach the problem of how to increase voter turnout. These approaches are called research designs, or the speci�ic methods that are used to collect, analyze, and interpret data. The choice of a design is not one to be made lightly; the way an investigator collects data trickles down to decisions about how to analyze the data and about the kinds of conclusions that can be drawn from the results. This section provides a brief introduction to the three main types of design—descriptive, correlational, and experimental.

Descriptive Research

Recall from Chapter 1 that a research study can have the basic goal of describing a phenomenon. If a research question centers around description, then the research design falls under the category of descriptive research, in which the primary goal is to describe thoughts, feelings, or behaviors. Descriptive research provides a static picture of what people are thinking, feeling, and doing at a given moment in time, as the following examples of research questions illustrate:

What percentage of doctors prefer Xanax for the treatment of anxiety? (thoughts) What percentage of registered Republicans vote for independent candidates? (behaviors) What percentage of Americans blame the president for the economic crisis? (thoughts) What percentage of college students experience clinical depression? (feelings) What is the difference in crime rates between Beverly Hills and Detroit? (behaviors)

What these �ive questions have in common is an attempt to get a broad understanding of a phenomenon without trying to delve into its causes.

The crime-rate example highlights the main advantages and disadvantages of descriptive designs. On the plus side, descriptive research is a good way to achieve a broad overview of a phenomenon and may inspire future research. It is also a good way to study things that are dif�icult to translate into a controlled experimental setting. For example, crime rates can affect every aspect of people’s lives, and this importance would likely be lost in an experiment that staged a mock crime in a laboratory. On the downside, descriptive research provides a static overview of a phenomenon and cannot explore the reasons for it. A descriptive design might tell us that Beverly Hills residents are half as likely as Detroit residents to be assault victims, but it would not reveal the underlying reasons for this discrepancy. (If we wanted to understand why this was true, we would use one of the other designs.)

Descriptive research can be either qualitative or quantitative; in fact, the large majority of qualitative research falls under the category of descriptive designs. Descriptions are quantitative when they attempt to make comparisons or to present a random sampling of people’s opinions. The majority of our example questions above would fall into this group because they quantify opinions from samples of households, or cities, or college students. Good examples of quantitative description appear in the “snapshot” feature on the front page of USA Today. The graphics represent poll results from various sources; the snapshot for May 15, 2015, reported that 90% of Americans crave more “variety” in their home-cooked meals (i.e., thoughts). View a current gallery of these snapshot graphs here: http://www.usatoday.com /services/snapshots/gallery/ (http://www.usatoday.com/services/snapshots/gallery/)

Descriptive designs are qualitative when they attempt to provide a rich description of a particular set of circumstances. A powerful example of this approach can be found in the work of the late neurologist Oliver Sacks. Sacks wrote several books exploring the ways that people with neurological damage or de�icits are able to navigate the world around them. In one selection from The Man Who Mistook His Wife for a Hat, Sacks (1998) relates the story of a man he calls

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 4/35

Johnathon Henninger/Connecticut Post/AP Images

Dr. Oliver Sacks studied how people with neurological damage formed and retained memories.

William Thompson. As a result of chronic alcohol abuse, Thompson developed Korsakov’s syndrome, a brain disease marked by profound memory loss. The memory loss was so severe that Thompson had effectively “erased” himself and could remember only scattered fragments of his past.

Whenever Thompson encountered people, he would frantically try to determine who he was. He would develop hypotheses and test them, as in this excerpt from one of Sacks’s visits:

I am a grocer, and you’re my customer, right? Well, will that be paper or plastic? No, wait, why are you wearing that white coat? You must be Hymie, the kosher butcher. Yep. That’s it. But why are there no bloodstains on your coat? (p. 112)

Sacks concluded that Thompson was “continually creating a world and self, to replace what was continually being forgotten and lost” (p. 113). With this story, Sacks helps illuminate Thompson’s experience and fosters readers’ gratitude for the ability to form and retain memories. This story also illustrates the trade-off in these sorts of descriptive case studies: Despite all its richness, we cannot generalize these details to other cases of brain damage; we would need to study and describe each patient individually.

Correlational Research

Recall from Chapter 1 that research studies can also have the goal of trying to predict a phenomenon. If a research question centers around prediction, then the research design falls under the category of correlational research, in which the primary goal is to understand the relationships among various thoughts, feelings, and behaviors. Examples of correlational research questions include:

Are people more aggressive on hot days? Are people more likely to smoke when they are drinking? Is income level associated with happiness? What is the best predictor of success in college? Does television viewing relate to hours of exercise?

What these questions have in common is the goal of predicting one variable based on another. If we know the temperature, can we predict aggression? If we know a person’s income, can we predict her level of happiness? If we know a student’s SAT scores, can we predict his college GPA?

These predictive relationships can turn out in one of three ways (Chapter 4 will provide more detail about each): A positive correlation means that higher values of one variable predict higher values of the other variable. For instance, more money is associated with higher levels of happiness, and less money is associated with lower levels of happiness. The key is that these variables move up and down together, as the �irst row of Table 2.1 shows. A negative correlation means that higher values of one variable predict lower values of the other variable. For example, more television viewing is associated with fewer hours of exercise, and fewer hours of television is associated with more hours of exercise. The key is that one variable increases while the other decreases, as the second row of Table 2.1 illustrates. Finally, worth noting is a third possibility, which is no correlation between two variables, meaning that we cannot predict one variable based on another. In brief, changes in one variable are not associated with changes in the other, as seen in the third row of Table 2.1.

Table 2.1: Three possibilities for correlational research

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 5/35

Figure 2.1: Correlation is not causation

Outcome Description Visual

Positive Correlation

Variables go up and down together. For example: Taller people have bigger feet, and shorter people have smaller feet.

Negative Correlation

One variable goes up, and the other goes down. For example: As a driver’s speed goes up, the time it takes to �inish the trip decreases.

No Correlation

The variables have nothing to do with one another. For example: Shoe size and number of siblings are completely unrelated.

Correlational designs are about testing predictions, but we are still unable to make causal, explanatory statements (that comes next). A common mantra in the �ield of psychology is that correlation does not equal causation. In other words, just because variable A predicts variable B does not mean that A causes B. This is true for two reasons, which we refer to as the directionality problem and the third variable problem. (See Figure 2.1.)

First, when we measure two variables at the same time, we have no way of knowing the direction of the relationship. Take the relationship between money and happiness: It could be true that money makes people happier, because they can afford nice things and fancy vacations. It could also be true that happy people have the con�idence and charm to obtain higher-paying jobs, resulting in more money. In a correlational study, we are unable to distinguish between these possibilities. Or, take the relationship between television viewing and obesity: It could be that people who watch more television get heavier, because TV makes them snack more and exercise less. It could also be that people who are overweight lack the energy to move around and end up watching more television as a consequence. Once again, we cannot identify a cause–effect relationship in a correlational study.

Second, when we measure two variables as they naturally occur, a third variable that actually causes both of them is always a possibility. For example, imagine we �ind a correlation between the number of churches and the number of liquor stores in a city. Do people build more churches to offset the threat of liquor stores? Do people build more liquor stores to rebel against churches? Most likely, the link involves a third variable, population size, that causes changes in both variables: The more people who are living in a city, the more churches and liquor stores they can support. As another example, imagine a correlation between ice cream sales and homicide rates is discovered. Does ice cream lead people to commit murder? Do murderers like to buy ice cream on the way home from the scene of the crime? Most likely, the link involves a third variable, temperature, that causes changes in both variables: The hotter it gets outside, the more people want ice cream, and the greater likelihood that disagreements will turn violent.

Experimental Research

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 6/35

Finally, recall that research projects can have the goal of attempting to explain a phenomenon. When the research goal involves causal explanations, then research design falls under the category of experimental research, in which the primary goal is to explain thoughts, feelings, and behaviors and to make causal statements. Examples of experimental research questions include:

Does smoking cause cancer? Does drinking alcohol make people more aggressive? Does loneliness cause alcoholism? Does stress cause heart disease? Can meditation make people healthier?

Research: Making an Impact

Helping Behaviors

The 1964 murder of Kitty Genovese in plain sight of her neighbors, none of whom helped, drove numerous researchers to investigate why people may not help others in need. Are individuals sel�ish and bad, or does a group dynamic lead to inaction? Is there something wrong with our culture, or are situations more powerful than we think?

Among the body of research conducted in the late 1960s and 1970s was one pivotal study that revealed why people may not help others in emergencies. Darley and Latané (1968) conducted an experiment with various individuals in different rooms who communicated with each other via intercom. In reality, the study included just one participant and a number of confederates, one of whom pretended to have a seizure. Among participants who thought they were the only other person listening over the intercom, more than 80% helped, and they did so in less than 1 minute. However, among participants who thought they were one of a group of people listening over the intercom, less than 40% helped, and even then only after more than 2.5 minutes. This phenomenon—that the more people who witness an emergency are present, the less likely any of them is to help—has been dubbed the “bystander effect.” One of the main reasons that this tendency occurs is that responsibility for helping gets “diffused” among all of the people present, so that each one feels less personal responsibility for taking action.

Darley and Latané’s research can be seen in action and has in�luenced safety measures in today’s society. For example, when someone witnesses an emergency, no longer does it suf�ice to simply yell to the group, “Call 911!” Because of the bystander effect, we know that most people will believe someone else will do it, and the call will not be made. Instead, it is necessary to designate a speci�ic person to make the call. In fact, part of modern-day CPR training involves making individuals aware of the bystander effect and best practices for getting people to help and be accountable.

Although the bystander effect may be the rule, there are always exceptions. For example, on September 11, 2001, the fourth hijacked airplane was overtaken by a courageous group of passengers. Most people on the plane had heard about the twin tower crashes and recognized that their plane was heading for Washington, D.C. Despite being amongst nearly 100 other people, a few people chose to help the intended targets in D.C. Risking their own safety, this heroic group chose to help to prevent others from experiencing death and suffering. So, although we may see events that remind us of the reality of the bystander effect, we also see moments where people are willing to help, no matter the number of people that surround them.

Think About It:

1. What type of research design best describes Darley & Latane’s (1968) study?

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 7/35

2. What practical applications have resulted from research on people’s reluctance to help in emergencies?

What these �ive questions have in common is a focus on understanding why something happens. Experiments move beyond, for example, the question of whether alcoholics are more aggressive to whether alcohol actually causes an increase in aggression.

Experimental designs are able to address the shortcomings of correlational designs because the researcher has more control over the environment. Chapter 5 will cover this in great detail, but the basic process of conducting an experiment is relatively simple: A researcher has to control the environment as much as possible so that all participants in the study have the same experience. This helps eliminate other third variables that might in�luence the results. Researchers will then manipulate, or change, one key variable and then measure outcomes in another key variable. The variable manipulated by the experimenter is called the independent variable (IV). The outcome variable that is measured by the experimenter is called the dependent variable (DV). The combination of controlling the setting and changing one aspect of this setting at a time allows the experimenter to state with some certainty that the changes caused something to happen.

Think of this in a little more concrete way. Imagine that a researcher wanted to test the hypothesis that meditation improves health. In this case, meditation would be the independent variable, and health would be the dependent variable. One way to test this hypothesis would be to take a group of people and have half of them meditate 20 minutes per day for several days while the other half did something else for the same amount of time. The group that meditates would be called the experimental group because it provides the test of the hypothesis. The group that does not meditate would be called the control group because it provides a basis of comparison for the experimental group.

The researcher would want to make sure that these groups spent the 20 minutes in similar conditions so that the only difference would be the presence or absence of meditation. One way to accomplish this would be to have all participants sit quietly for the 20 minutes but give the experimental group speci�ic instructions on how to meditate. Then, to test whether meditation led to increased health and happiness, the researcher would give both groups a set of outcome measures at the end of the study—perhaps a combination of survey measures and a doctor’s examination. If differences were found between the dependent measures for the two groups, the experimenter could be fairly con�ident that meditation caused them to happen. One way we can operationalize health outcomes in this study would be to measure blood pressure, as higher levels of blood pressure put people at risk for developing cardiovascular disease. So, for example, the researcher might �ind lower blood pressure in the experimental (meditation) group, which would suggest that meditation causes blood pressure to drop.

Choosing a Research Design

The choice of a research design is guided �irst and foremost by a researcher’s �inding the best �it to the research question and then adjusting it depending on practical and ethical concerns. At this point, a nagging question may come to mind: If experiments are the most powerful type of design, why not use them every time? Why would anyone give up the chance to make causal statements? One reason is that we are often interested in variables that cannot be manipulated, for ethical or practical reasons, and that therefore have to be studied as they occur naturally. In one example, Matthias Mehl and Jamie Pennebaker (2003) happened to start a weeklong study of college students’ social lives on September 10, 2001. Following the terrorist attacks on the morning of September 11, Mehl and Pennebaker were able to track changes in people’s social connections and use this to understand how groups respond to traumatic events. Of course, it would have been unthinkable to manipulate a terrorist attack for this study experimentally, but since it occurred naturally, the researchers were able to conduct a correlational study of coping.

Another reason to use descriptive and correlational designs is that these are useful in the early stages of a research program. For example, before a psychologist can start to think about the causes of binge drinking among college

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 8/35

students, it is important to understand how common is this phenomenon. Likewise, before a researcher designs a time- and cost-intensive experiment on the effects of meditation, it is a good idea to conduct a correlational study to test whether meditation even predicts health. In fact, this latter example comes from a series of real research studies conducted by psychiatrist Sara Lazar and her colleagues at Massachusetts General Hospital. This research team �irst discovered that experienced practitioners of mindfulness meditation had more development in brain areas associated with control over attention and emotion. But this study was correlational at best; perhaps meditation caused changes in brain structure or perhaps people who were better at integrating emotions were drawn to meditation. In a follow-up study, researchers randomly assigned people either to meditate or to perform stretching exercises for two months. These experimental �indings con�irmed that mindfulness meditation actually caused structural changes to the brain (Hölzel et al., 2011). This series of studies is a prime example of how a research program can progress from correlational to experimental designs.

Table 2.2 summarizes the main advantages and disadvantages of these three types of design. In addition, the bottom of the table includes two examples of research topics—meditation and health, and temperature and aggression—to showcase the similarities and differences between the designs.

Table 2.2: Summary of research designs

Research Design

Descriptive Correlational Experimental

Goal Describe characteristics of an existing phenomenon

Predict behavior; assess strength of relationship between variables

Explain behavior; assess impact of IV on DV

Advantages Provides a complete picture of what is occurring at a given time

Allows testing of expected relationships; predictions can be made

Allows conclusions to be drawn about causal relationships

Disadvantages

Does not assess relationships; no explanation for phenomenon

Cannot draw inferences about causal relationships

Cannot manipulate many important variables

Example #1: Studying Meditation

What percentage of college students meditate at least once a week?

Are regular meditators happier and healthier?

If we randomly assign people to start meditating, do they become happier and healthier?

Example #2: Temperature and Aggression

How many violent crimes are committed in the summer?

Are crime rates higher in the summer than in the winter?

If we turn up the temperature in the laboratory, do people become more aggressive?

Designs on the Continuum of Control

Before leaving the design overview behind, we will consider how these designs relate to one another. The best way to think about the differences between the designs is in terms of the amount of control a researcher has. That is, experimental designs are the most powerful because the researcher controls everything from the hypothesis to the environment in which the data are collected. Correlational designs are less powerful because the researcher is restricted to measuring variables as they occur naturally. However, with correlational designs, the researcher does maintain control over several aspects of data collection, including the setting and the choice of measures. Descriptive designs are the least powerful because researchers have dif�icultly controlling outside in�luences on data collection. For example, when people answer opinion polls over the phone, they might be sitting quietly and pondering the questions or they might be watching television, eating dinner, and dealing with a fussy toddler. As a result, researchers are more limited as to the conclusions they can draw from these data. Figure 2.2 shows an overview of where research designs fall on the continuum of control in order of increasing control: from descriptive, to predictive, to experimental. Chapters 3, 4, and 5 will cover variations on these designs in more detail.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=all… 9/35

Figure 2.2: The continuum of control framework

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 10/35

2.2 Reliability and Validity Each of the three types of research designs—descriptive, correlational, and experimental—has the same basic goal: to take a hypothesis about some phenomenon and translate it into measurable and testable terms. That is, whether researchers use a descriptive, correlational, or experimental design to test predictions about income and happiness, they still need to translate (or operationalize) the concepts of income and happiness into measures that will be meaningful for the study. Unfortunately, the sad truth is that research measurements will always be in�luenced by factors in addition to the conceptual variable of interest. Answers to any set of questions about happiness will depend both on actual levels of happiness and the ways people interpret the questions. The meditation experiment may have different effects, depending on people’s experience with meditation. Even describing the percentage of Republicans voting for independent candidates will vary according to characteristics of a particular candidate.

These additional sources of in�luence can be grouped into two categories: random and systematic errors. Random error involves chance �luctuations in measurements, such as when a participant misunderstands the question, or shows up in a terrible mood after walking through a blizzard to get to the study. Although random errors can in�luence measurement, they generally cancel out over the span of an entire sample. That is, some people may overreact to a question while others underreact. Heavy snowfall might put one person in a terrible mood and make another appreciate the joy of winter. While both of these examples would add error to a dataset, they would cancel each other out in a suf�iciently large sample.

Systematic errors, in contrast, are those that systematically increase or decrease along with values of the measured variable. For example, people who have more experience with meditation may consistently show more improvement in a meditation experiment than those with less experience. Or, people who have higher self-esteem may score higher on a measure of happiness than those with lower self-esteem. In this case, the happiness scale does not do a good job of homing in on the concept of “happiness” and will end up instead assessing a combination of happiness and self-esteem. These types of errors can cause more serious trouble for a researcher’s hypothesis tests because they interfere with the attempts to understand the link between two variables.

In sum, the measured values of a variable re�lect a combination of the true score, random error, and systematic error, as the following conceptual equation shows:

Measured Score = True Score + (Random Error + Systematic Error)

For example:

Happiness Score = Actual Happiness + (Misreading the Question + Self-Esteem)

So, if our measurements are also affected by outside in�luences, how do we know whether our measures are meaningful? Occasionally, the answer to this question is straightforward; if we ask people to report their weight or their income level, these values can be veri�ied using objective sources. Many research questions within psychology, however, involve more ambiguity. How do we know that our happiness scale is accurate? The problem is that we have no way to objectively verify happiness beyond people’s self-reports of their own happiness. What researchers need, then, are ways to assess how close they are to measuring happiness in a meaningful way. This assessment involves two related concepts: reliability, or the consistency of a measure; and validity, or the accuracy of a measure. This section examines both of these concepts in detail.

Reliability

The consistency of time measurement by watches, cell phones, and clocks re�lects a high degree of reliability. People think of a watch as reliable when it keeps track of the time consistently—an hour should take the same amount of time to pass, 24 times per day. Likewise, the scale is reliable when it gives the same value for weight in back-to-back measurements—an individual’s weight should be the same if he steps off the scale and right back on, provided he stays away from the fridge in the meantime.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=a… 11/35

Reliability is de�ined as the extent to which a measured variable is free from random errors, and it is best understood as the degree of consistency in research measurements. As the chapter discussed previously, researchers’ measures are never perfect, and �ive main sources of random error threaten reliability:

1. Transient states, or temporary �luctuations in participants’ cognitive or mental state; for example, some participants may complete a study after an exhausting midterm or after a �ight with their signi�icant others.

2. Stable individual differences among participants; for example, some participants are habitually more motivated or happier than other participants.

3. Situational factors in the administration of the study; for example, an experiment conducted in the early morning may make everyone tired or grumpy.

4. Bad measures that add ambiguity or confusion to the measurement; for example, participants may respond differently to a question about “the kinds of drugs you are taking.” Some may take this to mean illegal drugs, whereas others interpret it as prescription or over-the-counter drugs.

5. Mistakes in coding responses during data entry; for example, a handwritten “7” could be mistaken for a “4.” (Happily, these types of errors have been minimized by the increasing role of computers in data collection. If someone clicks the number “7” in an online survey, the computer will record it as a “7” almost every time.)

Researchers naturally want to minimize the in�luence of all of these sources of error, and the text will touch on techniques for doing so throughout. However, researchers are also resigned to the fact that all measurements contain a degree of error. The goal, then, is to develop an estimate of how reliable measures are. Researchers generally estimate reliability in three ways.

1. Test–retest reliability refers to the consistency of the measure over time—much like the examples of a reliable watch and a reliable scale. A fair number of research questions in the social and behavioral sciences involve measuring stable qualities. For example, if someone were to design a measure of intelligence or personality, both of these characteristics should be relatively stable over time. An individual score on an intelligence test today should be roughly the same as the score when tested again in �ive years. A person’s level of extroversion today should correlate highly with his or her level of extroversion in 20 years. The test– retest reliability of these measures is quanti�ied by simply correlating measures at two time points. The higher these correlations are, the higher the reliability will be. This makes conceptual sense as well; if measured scores re�lect the true score more than they re�lect random error, then this will result in increased stability of the measurements.

2. Inter-item reliability refers to the internal consistency among different items on a measure. Think back to the last time you completed a survey. Did it seem to ask the same questions more than once? (Chapter 4 [4.1] will discuss this technique.) The repetition is included because a single item is more likely to contain measurement error than the average of several items will—remember that small random errors tend to cancel out each other. Consider the following items from Sheldon Cohen’s Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983):

In the last month, how often have you felt that you were unable to control the important things in your life? In the last month, how often have you felt con�ident about your ability to handle your personal problems? In the last month, how often have you felt that things were going your way? In the last month, how often have you felt dif�iculties were piling up so high that you could not overcome them?

Each of these items taps into the concept of feeling “stressed out,” or overwhelmed by the demands of life. One standard way to evaluate a measure like this is by computing the average correlation between each pair of items, a statistic referred to as Cronbach’s alpha. The more these items tap into a central, consistent construct, the higher the value of this statistic is. Conceptually, a higher alpha means that variation in responses to the different items re�lects variation in the “true” variable being assessed by the scale items. Alpha levels range from zero to one, with higher numbers indicating more internal consistency. As a general rule, researchers want this index to be above 0.70 to have any con�idence in the measure.

3. Interrater reliability refers to the consistency among judges observing participants’ behavior. The previous two forms of reliability were relevant in dealing with self-report scales; interrater reliability is

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 12/35

more applicable when the research involves behavioral measures, which involve direct and systematic recording of observable behaviors. Imagine a researcher is studying whether alcohol consumption makes people behave more aggressively. One way to tackle this hypothesis would be to have a group of judges observe participants after drinking and rate their levels of aggression. In the same way that using multiple scale items helps to cancel out the small errors of individual items, using multiple judges cancels out the variations in each individual’s ratings. In this case, people could have slightly different ideas and thresholds for what constitutes aggression. To determine how much these differences matter, the researcher can evaluate the judges’ ratings by calculating the average correlation among the ratings. The higher the alpha values, the more the judges agree in their ratings of aggressive behavior. Conceptually, a higher alpha value means that variation in the judges’ ratings re�lects real variation in levels of aggression.

Validity

Recall the watch and scale examples. Perhaps some people set their watch 10 minutes ahead to avoid being late. Or perhaps certain individuals adjust their scale by 5 pounds to boost either their motivation or self-esteem. In these cases, the watch and the scale may produce consistent measurements, but the measurements are not accurate. It turns out that the reliability of a measure is a necessary but not suf�icient basis for evaluating it. Put bluntly, measures can be (and have to be) consistent, but they might still be worthless. The additional piece of the puzzle is the validity of measures, or the extent to which they accurately measure what they are designed to measure.

Whereas reliability is threatened more by random error, validity is threatened more by systematic error. If the measured scores on the happiness scale re�lect, say, self-esteem more than they re�lect happiness, this would threaten the validity of the scale. The previous section explained that a test designed to measure intelligence ought to be consistent over time. And, in fact, these tests do show very high degrees of reliability. However, several researchers have cast serious doubts on the validity of intelligence testing, arguing that even scores on an of�icial IQ test are in�luenced by a person’s cultural background, socioeconomic status (SES), and experience with the process of test-taking (for discussion of these critiques, see Daniels et al., 1997; Gould, 1996). For example, children growing up in higher SES households tend to have more books in the home, spend more time interacting with one or both parents, and attend schools that have more time and resources available—all of which correlate with scores on IQ tests. Thus, because all of these factors could increase scores on an intelligence test, they amount to systematic error in the measure of intelligence and, therefore, threaten the validity of a measured score on an intelligence test.

Researchers have two primary ways to discuss and evaluate the validity, or accuracy, of measures: construct validity and criterion validity.

Researchers evaluate construct validity based on how well the measures capture the underlying conceptual ideas (i.e., the constructs) in a study. These constructs are equivalent to the “true score” discussed in the previous section. That is, how accurately does the bathroom scale measure the construct of weight? How accurately does an IQ test measure the construct of intelligence relative to other things? The validity of measures can be assessed in a couple of ways. On the subjective end of the continuum, researchers can evaluate construct validity by assessing the face validity of the measure, or the extent to which it simply seems like a good measure of the construct. The items from the Perceived Stress Scale have high face validity because the items match what we intuitively mean by “stress” (e.g., “How often have you felt dif�iculties were piling up so high that you could not overcome them?”). However, if we were to measure an individual’s speed at eating hot dogs and then state it was a stress measure, the participant might be skeptical because hot-dog eating speed would lack face validity as a measure of stress.

Although face validity is nice to have, it can sometimes (ironically) reduce the validity of the measures. Imagine seeing the following two measures on a survey of attitudes:

1. Do you dislike people whose skin color is different from yours? 2. Do you ever beat your children?

On the one hand, these are extremely face-valid measures of attitudes about prejudice and corporal punishment— the questions very much capture our intuitive ideas about these concepts. On the other hand, even people who do

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 13/35

Jupiterimages/Stockbyte/Thinkstock

Criterion validity can be used to predict a future behavioral outcome like management success.

support these attitudes may be unlikely to answer honestly because they recognize that neither attitude is popular. In cases like this, a measure low in face validity might end up being the more accurate approach. Chapter 4 will discuss ways to strike this balance.

On the less subjective end, researchers can evaluate construct validity by examining measures’ empirical connections to both related and unrelated constructs. Imagine for a moment that we are developing a new measure of liberal political attitudes. If we think about a person who describes herself as liberal, she is likely to support gun control, equal rights, and a woman’s right to choose. And, she is less likely to be pro-war, anti-immigration, or anti- gay rights. Therefore, we would expect our new liberalism measure to correlate positively with existing measures of attitudes toward guns, af�irmative action, and abortion. This pattern of correlations taps into the metric of convergent validity, or the extent to which our measure overlaps with conceptually similar measures. But, we would want to ensure that the new measure captures something distinct from other constructs. In this case, we might want to demonstrate that we have developed a true measure of political attitudes, which does not simply correlate with religious beliefs. That is, we would want to show that liberal political views could be independent of religion. This hypothesized lack of correlations taps into the metric of discriminant validity, or the extent to which a measure diverges from unrelated measures.

To take another example, imagine someone wanted to develop a new measure of narcissism, usually de�ined as an intense desire to be liked and admired by other people. Narcissists tend to be self-absorbed but also very attuned to the feedback they receive from other people—especially feedback about the extent to which people admire them. Narcissism somewhat resembles self-esteem but differs enough; perhaps it is best viewed as high and unstable self- esteem. So, given these facts, a researcher might assess the discriminant validity of the measure by making sure it does not overlap too closely with measures of self-esteem or self-con�idence. This approach would establish that the narcissism measure stands apart from these different constructs. The researcher might then assess the convergent validity of the measure by making sure that it does correlate with things like sensitivity to rejection and need for approval. These correlations would place the measure into a broader theoretical context and help to establish it as a valid measure of the construct of narcissism.

Criterion validity involves evaluating the validity of measures based on the association between measures and relevant behavioral outcomes. The “criterion” in this case refers to a measure that can be used to make decisions. For example, if someone developed a personality test to assess an individual’s management style, the most relevant metric of its validity is whether it predicts a person’s actual behavior as a manager. That is, we might expect people scoring high on this scale to be able to increase the productivity of their employees and to maintain a comfortable work environment. Likewise, if someone developed a measure that predicted the best careers for graduating seniors based on their skills and personalities, then criterion validity would be assessed using people’s actual success in these various careers. Whereas construct validity is more concerned with the underlying theory behind the constructs, criterion validity is more concerned with the practical application of measures. As might be expected, researchers are more likely to use this

approach in applied settings.

That said, criterion validity is also a useful way to supplement validation of a new questionnaire. For example, a questionnaire about generosity should be able to predict people’s annual giving to charities, and a questionnaire about hostility ought to predict hostile behaviors. To supplement the construct validity of the narcissism measure, a researcher might examine its ability to predict the ways people respond to rejection and approval. Based on the de�inition of the construct, the researcher might hypothesize that narcissists would become hostile following rejection and perhaps become eager to please following approval. If these predictions were supported, it would mean further validation that the measure was capturing the construct of narcissism.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 14/35

Criterion validity falls into one of two categories, depending on whether the researcher is interested in the present or the future. Predictive validity involves attempting to predict a future behavioral outcome based on the measure, as in the examples of the management-style and career-placement measures. Predictive validity is also at work when researchers (and colleges) try to predict graduates’ likelihood of school success based on SAT or GRE scores. The goal here is to validate the construct via its ability to predict the future.

In contrast, concurrent validity involves attempting to link a self-report measure with a behavioral measure collected at the same time, as in the examples of the generosity and hostility questionnaires. The phrase “at the same time” is used vaguely here; these self-report and behavioral measures may be separated by a short time span. In fact, concurrent validity sometimes involves trying to predict behaviors that occurred before completion of the scale, such as trying to predict students’ past drinking behaviors from an “attitudes toward alcohol” scale. The goal in this case is to validate the construct via its association with similar measures.

Comparing Reliability and Validity

This section has discussed how both reliability (consistency) and validity (accuracy) are ways to evaluate measured variables and to assess how well these measurements capture the underlying conceptual variable. In establishing estimates of both of these metrics, researchers essentially examine a set of correlations with their measured variables. But while reliability involves correlating variables with themselves (e.g., happiness scores at week 1 and week 4), validity involves correlating variables with other variables (e.g., happiness scale with the number of times a person smiles). Figure 2.3 displays the relationships among types of reliability and validity.

Figure 2.3: Types of reliability and validity

We learned earlier that reliability is necessary but not suf�icient to evaluate measured variables. That is, reliability has to come �irst and is an essential requirement for any variable—no one would trust a watch that was sometimes �ive minutes fast and other times ten minutes slow. If we cannot establish that a measure is reliable, then there is really no chance of establishing its construct validity because every measurement might be a re�lection of random error. However, just because a measure is consistent does not make it accurate. Someone’s watch might consistently be ten minutes fast; a scale might always be �ive pounds under the person’s actual weight. For that matter, a test of intelligence might result in consistent scores but actually be capturing respondents’ cultural background. Reliability tells us the extent to which a measure is free from random error. Validity takes the second step of telling us the extent to which the measure is also free from systematic error.

Finally, it is worth pointing out that establishing validity for a new measure is hard work. Reliability can be tested in a single step by correlating scores from multiple measures, multiple items, or multiple judges within a study. But testing the construct validity of a new measure involves demonstrating both convergent and discriminant validity. In developing our narcissism scale, we would need to show that it correlated with things like fear of rejection (convergent) but was reasonably different from things like self-esteem (discriminant). The latter criterion is particularly dif�icult to establish because it takes time and effort—and multiple studies—to demonstrate that one

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 15/35

scale is distinct from another. However, an easy way exists to avoid these challenges: Use existing measures whenever possible. Before creating a brand-new happiness scale, or narcissism scale, or self-esteem scale, check the research literature to see if one exists that has already gone through the ordeal of being validated.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 16/35

2.3 Scales and Types of Measurement One of the easiest ways to decrease error variance and thereby increase reliability and validity is to make smart choices when designing and selecting measures. Throughout this book, we will discuss guidelines for each type of research design and ways to ensure that measures are as accurate and unbiased as possible. This section examines some basic rules that apply across all three types of design. We �irst review the four scales of measurement and discuss the proper use of each one; we then turn our attention to three types of measurement used in psychological research studies.

Scales of Measurement

Whenever researchers perform the process of translating conceptual variables into measurable variables (i.e., operationalization; see Chapter 1, section 1.2), they must ensure that their measurements accurately represent the underlying concepts. In Chapter 1, the discussion of validity explained that this accuracy is a critical piece of hypothesis testing. For example, if researchers develop a scale to measure job satisfaction, then they need to verify that this is actually what the scale is measuring.

However, measurement accuracy has an additional, subtler dimension: We also need to be sure that the numbers used in our chosen measurement accurately re�lect the underlying mathematical properties of the concept. In many cases in the natural sciences, this process is automatically precise. When we measure the speed of a falling object or the temperature of a boiling object, the underlying concepts (speed and temperature) translate directly into scaled measurements. In the social and behavioral sciences, though, this process is trickier; researchers have to decide carefully how best to represent abstract concepts such as happiness, aggression, and political attitudes. As researchers take the step of scaling variables, or specifying the relationship between a conceptual variable and numbers on a quantitative measure, they have four different scales to choose from, presented below in order of increasing statistical power and �lexibility.

Nominal Scales Nominal scales are used to label or identify a particular group or characteristic. For example, we can label a person’s gender as male or female, and we can label a person’s religion as Catholic, Protestant, Buddhist, Jewish, Muslim, Hindu, etc. In experimental designs, researchers can also use nominal scales to label the condition to which a person has been assigned (e.g., experimental or control groups). The assumption in using these labels is that members of the group have some common value or characteristic, as de�ined by the label. For example, everyone in the Catholic group should have similar religious beliefs, and everyone in the female group should be of the same gender.

Research studies commonly represent these labels using numeric codes in a data �ile, such as 1 to indicate females and 2 to indicate males. However, these numbers are completely arbitrary and meaningless—that is, males do not have more gender than females. We could just as easily replace the 1 and the 2 with another pair of numbers or with a pair of letters or names. Thus, the primary limitation of nominal scales is that the scaling itself is arbitrary, which prevents us from using these values in mathematical calculations. One helpful way to appreciate the difference between this scale and the next three is to think of nominal scales as qualitative, because they label and identify, and to think of the other scales as quantitative, because they indicate the extent to which someone possesses a quality or characteristic. The next sections explore these quantitative scales in more detail.

Ordinal Scales Researchers use ordinal scales to represent ranked orders of conceptual variables, such that higher numbers re�lect increasing magnitude of the underlying variable. For example, beauty contestants, horses, and Olympic athletes are all ranked by the order in which they �inish—�irst, second, third, and so on. Likewise, movies, restaurants, and consumer

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 17/35

HasseChr/iStock Editorial/Thinkstock

An ordinal scale can place these three women in �irst, second, and third, but it cannot tell you how far apart they �inished in their race.

goods are often rated using a system of stars (i.e., 1 star is poor; 5 stars is excellent) to represent their quality. In these examples, we can draw conclusions about the relative speed, beauty, or deliciousness of the rating target. Even so, the numbers used to label these rankings do not necessarily map directly to differences in the conceptual variable. The fourth-place �inisher in a race is rarely twice as slow as the second-place �inisher; the beauty-contest winner is not three times as attractive as the third-place �inisher; and the boost in quality between a four-star and a �ive-star restaurant is not the same as the boost between a two-star and three-star restaurant. Ordinal scales represent rank orders, but the numbers do not have any absolute value of their own. This type of scale, then, is more powerful than a nominal scale but still limited in that it does not allow performance of mathematical operations. For example, if an Olympic athlete �inished �irst in the 800-meter dash, third in the 400- meter hurdles, and second in the 400-meter relay, we might be tempted to calculate her average �inish as second place. Unfortunately, the properties of ordinal scales prevent us from doing this sort of calculation, because the conceptual distance between �irst, second, and third place would be different in each case. (That is, the runner might have won the 800-meter dash by 5 seconds, but the 400-meter relay by less than a second.) To perform any mathematical manipulation of variables requires one of the next two types of scale.

Interval Scales Interval scales represent cases where the numbers on a measured variable correspond to equal distances on a conceptual variable. For example, temperature increases on the Fahrenheit scale represent equal intervals— warming from 40 to 47 degrees is the same increase as warming from 90 to 97 degrees. Interval scales share the key feature of ordinal scales—higher numbers indicate higher relative levels of the variable—but interval scales go an important step further. Because these numbers represent equal intervals, we are able to add, subtract, and compute averages. That is, whereas we could not calculate the athlete’s average �inish, we can calculate the average temperature in San Francisco or the average age of participants.

Ratio Scales Ratio scales go one �inal step further, representing interval scales that also have a true zero point, that is, the potential for a complete absence of the conceptual variable. Physical measurements, such as length, weight, and time represent ratio scales, because it is possible to have a complete absence of any of these. Most behavioral measures also represent ratio scales, as it is possible to have zero drinks per day, zero presses of a reward button, or zero symptoms of the �lu. Temperature in degrees Kelvin is measured on a ratio scale because 0 degrees Kelvin indicates an absence of molecular motion. (In contrast, 0 degrees Fahrenheit is merely a center point on the temperature scale.) Contrast these measurements with many of the conceptual variables featured in psychology research—no such things as zero attitude toward gun control or zero self-esteem exist. The big advantage of having a true zero point is that it allows us to add, subtract, multiply, and divide scale values. When we measure weight, for example, it makes sense to say that a 300-pound adult weighs twice as much as a 150-pound adult. Likewise, it makes sense to say that having two drinks per day is only one-fourth as many as having eight drinks per day.

Choosing and Using Scales of Measurement The take-home point from the discussion of these four scales of measurement is twofold. First, researchers should always use the most powerful and �lexible scale possible for their conceptual variables. In many cases, no choice is possible; time is measured on a ratio scale and gender is measured on a nominal scale. But some cases permit researchers a bit more freedom in designing their study. For example, if someone were interested in correlating

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 18/35

weight with happiness, the researcher could capture weight in a few different ways. One option would be to ask people their satisfaction with their current weight on a seven-point scale. However, the resulting data would be on an ordinal or interval scale (see discussion below), and the degree to which the researcher could manipulate the scale values would be limited. Another, more powerful option, would be to measure people’s weight on a bathroom scale, resulting in ratio-scale data. Whenever possible, it is preferable to incorporate physical or behavioral measures. But it is also preferable—actually, required—to represent data accurately. Most variables in the social and behavioral sciences do not have a true zero point and must therefore be measured on nominal, ordinal, or interval scales.

Second, researchers should always be aware of the limitations of their measurement scale. As discussed above, these scales lend themselves to different amounts of mathematical manipulation. It is not possible to calculate statistical averages with anything less than an interval scale and not possible to multiply or divide anything less than a ratio scale. What does this mean for researchers? If they have collected ordinal data, they are limited to discussing the rank ordering of the values (e.g., the critics liked Restaurant A better than Restaurant B). If they have collected nominal data, they are limited to describing the different groups (e.g., percentages of Catholics and Protestants).

One prominent grey area for both of these points is the use of attitude scales in the social and behavioral sciences. If we were to ask people to rate their attitudes about the death penalty on a seven-point rating scale, would the scale be ordinal or interval? This consideration turns out to be a contentious issue in the �ield. From the conservative point of view, these attitude ratings constitute only ordinal scales. We know that a 7 indicates more endorsement than a 3 but cannot say that moving from a 3 to a 4 is equivalent to moving from a 6 to a 7 in people’s minds. From the more liberal point of view, these attitude ratings can be viewed as interval scales. A researcher’s perspective is often driven by practical concerns—treating these as equal intervals allows us to compute totals and averages for our variables. Chapter 4 will return to this issue in discussing the creation of questionnaire items. For now, a good guideline is to assume that these individual attitude questions represent ordinal scales by default.

Types of Measurement

Each of the four scales of measurement can be used across a wide variety of research designs. In this section, we shift gears slightly and discuss measurement at a more conceptual, less mathematical level. The types of dependent measures used in psychological research studies can be grouped into three broad categories: behavioral, physiological, and self-report.

Behavioral Measurement As mentioned earlier, behavioral measures are those that involve direct and systematic recording of observable behaviors. If a research question involves the ways that married couples deal with con�lict, the researcher could include a behavioral measure by observing the way participants interact during an argument. Do they cut one another off? Listen attentively? Express hostility? Behaviors can be measured and quanti�ied in one of four primary ways, as the scenario of observing married couples during con�lict situations illustrates:

Frequency measurements involve counting the number of times a behavior occurs. For example, researchers could count the number of times each member of the couple rolled his or her eyes as a measure of dismissive behavior. Duration measurements involve measuring the length of time a behavior lasts. For example, researchers could quantify the length of time the couple spends discussing positive versus negative topics as a measure of emotional tone. Intensity measurements involve measuring the strength or potency of a behavior. For example, researchers could quantify the intensity of anger or happiness in each minute of the con�lict using ratings by trained judges. Latency measures involve measuring the delay before onset of a behavior. For example, researchers could measure the time between one person’s provocative statement and the other person’s response.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 19/35

John Gottman, a psychologist at the University of Washington, has been conducting research along these lines for several decades, observing body language and interaction styles among married couples as they discuss an unresolved issue in their relationship (read more about this research and its implications for therapy on Dr. Gottman’s website, http://www.gottman.com/ (http://www.gottman.com/) ). What all of these behavioral measures provide is an unobtrusive way to measure the health of a relationship. That is, the major strength of behavioral responses is that they are typically more honest and un�iltered than responses to questionnaires. As Chapter 4 will discuss, people are sometimes dishonest on questionnaires to convey a more positive (or less negative) impression.

Behavioral responses offer a particular bene�it for researchers interested in unpopular attitudes, such as prejudice and discrimination. If we were to ask people the extent to which they dislike members of other ethnic groups, they might not admit to these prejudices. Alternatively, a researcher could adopt the approach used by Yale psychologist Jack Dovidio and colleagues and measure how close people sat to people of different ethnic and racial groups, using this distance as a subtle and effective behavioral measure of prejudice (see http://www.yale.edu/intergroup/ (http://www.yale.edu/intergroup/) for more information). But the primary downside to using behavioral measures may be evident: We end up having to infer the reasons that people behave as they do. Suppose that in one of these experiments, European-American participants, on average, sit farther away from African-Americans than from other European-Americans. This could—and often does—indicate prejudice; however, for the sake of argument, the farthest seat from the minority group member might also be the comfortable recliner with great lighting next to the window. To understand the reasons for behaviors, researchers have to supplement the behavioral measures with either physiological or self-report measurements.

Physiological Measurement Physiological measures are those that involve quantifying bodily processes, including heart rate, brain activity, and facial muscle movements. If we were interested in the experience of test anxiety, we could measure heart rates as people complete a dif�icult math test. If we wanted to study emotional reactions to political speeches, we could measure heart rate, facial muscles, and brain activity as people view video clips. These types of measures’ big advantage is that they are the least subjective and controllable. It is incredibly dif�icult for people to control their heart rate or brain activity consciously, making these a great tool for assessing emotional reactions. However, as with behavioral measures, we also need some way to contextualize physiological data.

The best example of this shortcoming is the use of the polygraph, or lie detector, to detect deception. The lie-detector test involves connecting a variety of sensors to the body to measure heart rate, blood pressure, breathing rate, and sweating. All of these are physiological markers of the body’s �ight-or-�light stress response, and the test’s goal is to measure whether someone shows signs of stress while being questioned. But here is the problem: Being falsely accused is also stressful. A trained polygraph examiner must place all of the accused’s physiological responses in the proper context. Is the individual stressed throughout the exam or only stressed when asked whether he pilfered money from the cash box? Is the person stressed when asked about her relationship with her spouse because she killed him or because she was having an affair? The examiner has to be extremely careful to avoid false accusations based on misinterpretations of physiological responses. (For a recent commentary on the use of the polygraph in the courtroom, see http://www.thedailybeast.com/articles/2015/02/04/the-polygraph-has-been-lying-for-90- years.html (http://www.thedailybeast.com/articles/2015/02/04/the-polygraph-has-been-lying-for-90-years.html) ). The same cautions apply to using these measures in psychological research: Does heart rate increase because participants are stressed by a political message, or because the experiment is taking too long, and they are late to another appointment? The researcher should always include additional measures in the study to help sort out the reasons behind physiological change.

Self-Report Measurement Self-report measures are those that involve asking people to report on their own thoughts, feelings, and behaviors. If we were interested in the relationship between income and happiness, we could simply ask people to report their income and their level of happiness. If we wanted to know whether people were satis�ied in their romantic relationships, we could

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 20/35

Digital Vision/Photodisc/Thinkstock

A self-report measure might be used to determine how likely voters are to support a candidate.

simply ask them to rate their degree of satisfaction. The major advantage of these measures is that they provide access to internal processes. That is, if we want insight into why people voted for their favorite political candidate, the only option is to ask them. However, as the text has suggested already, people may not necessarily be honest and forthright in their answers, especially when dealing with politically incorrect or unpopular attitudes. Chapter 4 will return to this tension and discuss ways to increase the likelihood of honest self-reported answers.

Two broad categories of self-report measures can be used. One of the most common approaches is to ask for people’s responses using a �ixed- format scale, which asks them to indicate their opinion on a preexisting scale. For example, a researcher might ask people, “How likely are you to vote for the Republican candidate for president?” on a scale from 1 (not likely) to 7 (very likely). The other broad approach is to obtain responses using a free-response format, which asks people to express their opinion in an open-ended format. For example, researchers might ask people to explain, “What are the factors you consider in choosing a political candidate?” The trade-off between these two categories is essentially a choice between data that is easy to code and analyze and data that is rich and complex. In general, �ixed-format scales are used more in quantitative research, while free-response formats are used more in qualitative research. Chapter 4 will discuss these categories further in a discussion of survey research.

Research: Thinking Critically

Neuroscience and Addictive Behaviors

Follow the link below to read an article by journalist Christian Nordqvist. In this article, Nordqvist reviews recent research suggesting that food addiction might involve brain mechanisms similar to those involved in drug addiction. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.

http://www.medicalnewstoday.com/articles/221233.php (http://www.medicalnewstoday.com/articles/221233.php)

Think About It:

1. Is the study described here descriptive, correlational, or experimental? Explain. 2. Can we conclude from this study that food addiction causes brain abnormalities? Why or why not? 3. The authors of the study concluded: “The current study also provides evidence that objectively

measured biological differences are related to variations in YFAS (Yale Food Addiction Scale) scores, thus providing further support for the validity of the scale.” What type(s) of validity are they referring to? Explain.

4. What types of measures are included in this study (e.g., behavioral, self-report)? What are the strengths and limitations of these measures in this study?

Converging Operations: The Best of All Worlds

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 21/35

As these descriptions show, each type of measurement has its strengths and �laws. So, how do researchers decide which one to use? This question has to be answered for every case, and the answer involves consideration of three factors: �it with the research question; insights from previous research; and practical considerations like budget and equipment availability. However, in an ideal world, a program of research will use a wide variety of measures and designs. The term for this approach is converging operations, or the use of multiple research methods to solve a single problem. In essence, over the course of several studies—perhaps spanning several years—a researcher would address a research question using different designs, different measures, and different levels of analysis.

One good example of converging operations comes from the research of psychologist James Gross and his colleagues at Stanford University. Gross and his team study the ways that people regulate their emotional responses and has conducted this work using everything from questionnaires to brain scans (see http://spl.stanford.edu/projects.html (http://spl.stanford.edu/projects.html) ).

One branch of Gross’s research has examined the consequences of trying to either suppress emotions (pretend they are not happening) or reappraise them (think of them in a different light). Gross’s team studies suppression by asking people to hold in their emotional reactions while watching a graphic medical video. The researchers study reappraisal by asking people to watch the same video while trying to view it as a medical student, thus changing the meaning of what they see. When people try to suppress their emotional responses, they experience an ironic increase in physiological and self-reported emotional responses, as well as de�icits in cognitive and social functioning. When reappraising emotions, on the other hand, people experience lower levels of both reported and physiological emotion, without any loss of other functioning. In another branch of the research, Gross and colleagues have examined the neural processes at work when people change their perspective about an emotional event. In yet another branch of the research, they have used self-report measures to examine individual differences in emotional responses, with the goal of understanding why some people are more capable of managing their emotions than others. Taken together, these studies all converge into a more comprehensive picture of the process of emotion regulation than would be possible from any single study or method.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 22/35

2.4 Hypothesis Testing Regardless of the details of a particular study, be it correlational, experimental, or descriptive, all quantitative research follows the same process of testing a hypothesis. This section provides an overview of this process, including a discussion of the statistical logic, the �ive steps of the process, and the two ways we can make mistakes during our hypothesis test. Some of this material may be a review from statistics class, but it forms the basis of our scienti�ic decision-making process and thus warrants repeating.

The Logic of Hypothesis Testing

Chapter 1 discussed several criteria for identifying a “good” theory, one of which is that theories have to be falsi�iable. In other words, research questions should have the ability to be proven wrong under the right set of conditions. Why is this so important? This will sound counterintuitive at �irst, but by the standards of logic, when data run counter to a researcher’s theory, that is more meaningful than when data support the theory.

For example, suppose we hypothesize that growing up in a low-income family puts children at higher risk for depression. If the data �it this pattern, our prediction might very well be correct. It is also possible, however, that these results are due to a third variable—perhaps low-income families grow up in more stressful neighborhoods, and stress turns out to increase a person’s depression risk. Or, perhaps our sample accidentally contained an abnormal number of depressed people. This is why we are always cautious in interpreting positive results from a single study. Yet now, imagine that we test the same hypothesis and �ind that those who grew up in low-income families show a lower rate of depression. This is still a single study, but it suggests that our hypothesis may have been off-base.

Another way to think about this is from a statistical perspective. As the chapter discussed earlier, all measurements contain some amount of random error, which means that any pattern of data could be caused by random chance. This is the primary reason that research is never able to “prove” a theory. We will learn (or recall) from the study of statistics that at the end of any hypothesis test, we calculate a p value, representing the probability of observing our results—or results that are even more extreme—due entirely to random chance. Conceptually, we are calculating the probability that we are wrong rather than the probability that we are right in our predictions. And the bigger the effect, the smaller this probability will generally be. So, as strange as it seems, the ideal result of hypothesis testing is to have a small probability of being wrong.

This focus on falsi�iability carries over to the way we test our hypotheses, in that the goal is to reject the possibility of results being due to chance. The starting point of a hypothesis test is to state a null hypothesis, or the assumption that the variables have no real effect in the overall population. This is another way of saying that observed patterns of data are due to random chance. In essence, we propose this null in hopes of minimizing the odds that it is true. Then, as a counterpoint to the null hypothesis, we propose an alternative hypothesis that represents the predicted pattern of results. This part is a little confusing, because the word alternative actually refers to the hypothesis in which we are interested. The term is employed because, in statistical jargon, the alternative hypothesis represents the predicted deviation from the null. These alternative hypotheses can be directional, meaning that we specify the direction of the effect, or nondirectional, meaning that we simply predict an effect.

Say we want to test the hypothesis that people like cats better than dogs. We would start with the null hypothesis, that people like cats and dogs the same amount (i.e., no difference). The next step is to state the alternative hypothesis (that is, our actual hypothesis), which in this case is that people will prefer cats. Because we are predicting a direction (cats more than dogs), this hypothesis is directional. The other option would be a nondirectional hypothesis, or simply stating that people’s cat preferences differ from their dog preferences. (Note that we have avoided predicting which one people like better, what makes it nondirectional.)

Finally, these three hypotheses can also be expressed using logical notation, as shown below. The letter H is used as an abbreviation for “hypothesis,” and the Greek letter µ is a common abbreviation for the mean, or average.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 23/35

Conceptual Hypothesis: People like cats better than dogs.

Null Hypothesis: H0: µcat = µdog

the “cat” mean is equal to the “dog” mean;

people like cats and dogs the same

Nondirectional Alternative Hypothesis: H1: µcat ≠ µdog

the “cat” mean is not equal to the “dog” mean;

people like cats and dogs different amounts

Directional Alternative Hypothesis: H1: µcat > µdog

the “cat” mean is greater than the “dog” mean;

people like cats more than dogs

Why distinguish between directional and nondirectional hypotheses? A statistics class provides a more detailed answer, but it is important to note that this decision will have implications for the level of statistical signi�icance. In essence, nondirectional hypotheses are less precise: “I think there is a difference,” versus “I believe cats are the preferred pet!” Because we always want to minimize the risk of coming to the wrong conclusion, we have to be more conservative with a nondirectional test. In this context, being conservative means needing a bigger group difference to feel con�ident in the results.

In the cats-versus-dogs example, a larger difference in ratings would be needed to support the claim that people like cats and dogs different amounts than would be needed to support the claim that people like cats more than dogs. The goal of all this statistical and logical jargon is to place hypothesis testing in the proper frame. The most important thing to remember is that hypothesis testing is designed to reject the null hypothesis, and statistical tests tell us how con�ident to be in this rejection.

Five Steps to Hypothesis Testing

Now that we understand how to frame a hypothesis, what does a researcher do with this information? Framing a hypothesis is the �irst step of a �ive-step process of testing a hypothesis. This section walks through an example of hypothesis testing from start to �inish, that is, from an initial hypothesis to a conclusion about the hypothesis. Using a �ictitious study, we will test the prediction that married couples without children are happier than those with children in the home. This example is inspired by an actual study by Harvard social psychologist Dan Gilbert and his colleagues, described in a news article at http://www.telegraph.co.uk/news/1941195/Marriage-without- children-the-key-to-bliss.html (http://www.telegraph.co.uk/news/1941195/Marriage-without-children-the-key-to- bliss.html) . The hypothesis may seem counterintuitive, but Gilbert’s research suggests that people tend to both overestimate the extent to which children will make them happy and underestimate the added stress and �inancial demands of having children in the house.

Step 1—State the Hypothesis The �irst step in testing this hypothesis is to spell it out in logical terms. Remember that we want to start with the null hypothesis that the presence of children in a home has no effect. So, in this case, the null hypothesis would be that couples are equally happy with and without children. Or, in logical notation, H0: µchildren = µno children (i.e., the mean happiness rating for couples with children equals the mean happiness rating for couples without children). From there, we can spell out our alternative hypothesis; in this case, we predict that having children will make

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 24/35

couples less happy. Because this is a directional hypothesis, we write H1: µchildren < µno children (i.e., the mean happiness rating for couples with children is lower than the mean happiness rating for couples without children).

Step 2—De�ine Variables Once we have an idea of the conceptual relationship that we want to test, we need to translate these concepts into measurable variables. As the chapter has discussed more than once, the decisions we make at this stage will trickle down and in�luence every subsequent step of the research process. For our current example, we will need to �ind a way to de�ine the concept of “happiness,” as well as decide our criteria for “couples with / without children.” We have encountered happiness as an example before, so it seems fairly straightforward to de�ine it based on participants’ responses to a happiness scale. But what does it mean for a couple to have children? Do the children need to be of a certain age, or would the study include everyone from parents of newborns to empty-nesters whose children are away at college? These types of decisions need to be made carefully, to ensure that we are controlling outside in�luences that might interfere with our hypothesis test. For example, couples who survive the trials and tribulations of raising a toddler without getting divorced may come to develop a more realistic set of expectations for their everyday happiness, compared to the parents of newborns or the parents of college students.

Step 3—Collect Data The next step is to design and conduct a study that will test our hypothesis. The next three chapters will elaborate on this process in great detail, but the general idea is the same regardless of the design. In this case, the most appropriate design would be correlational because we want to predict happiness based on whether people have children. It would be impractical and unethical to randomly assign people to have children, so an experimental design is not possible in this case. One way to conduct our study would be to survey married couples about whether they had children and ask them to rate their current level of happiness with the marriage. Suppose we conduct this study and end up with the data in Figure 2.4.

Figure 2.4: Sample data for the “children and happiness” study

As the �igure shows, the results suggest an average happiness rating of 5.7 for couples without children, compared to an average happiness rating of 2.0 for couples with children. These groups certainly look different—and encouraging for our hypothesis—but we need to be sure that the difference is big enough that we can reject the null hypothesis.

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 25/35

Step 4—Calculate Statistics The next step in our hypothesis test is to calculate statistical tests to decide how con�ident we can be that the results are meaningful. Researchers have a wide variety of statistical tools at their disposal and different ways to analyze all manner of data. These tools can be broadly grouped into descriptive statistics, which describe the patterns and distribution of measured variables, and inferential statistics, which attempt to draw inferences about the population from which the sample was drawn. Researchers use inferential statistics to make decisions about the signi�icance of the data. Statistics courses cover many of these in detail, and we will discuss a few examples throughout this book. All of these different techniques share a common principle: They attempt to make inferences by comparing the relationship among variables to the random variability of the data. As the chapter discussed earlier, people’s measured levels of everything from happiness to heart rate can be in�luenced by a wide range of variables. The hope in testing our hypotheses is that differences in our measurements will primarily re�lect differences in the variables we are studying. In the current example, we would want to see that differences in happiness ratings of the married couples were in�luenced more by the presence of children than by random �luctuations in happiness. Regardless of which statistic a researcher chooses to test the hypothesis, the resulting value will be translated into a measure of statistical signi�icance, and this provides a key piece of information for the �inal decision.

Step 5—Make a Decision Finally, we are able to draw a conclusion about our experiment. Based on the outcome of our statistical test (i.e., step 4), we will make one of two decisions about our null hypothesis:

Reject null: decide that the probability of the null being correct is suf�iciently small; that is, results are due to differences in groups

or

Fail to reject null: decide that the probability of the null being correct is too big; that is, results are due to chance

Given the mean difference in Figure 2.4, and the small amount of error, our statistical test would certainly be signi�icant, and we could be con�ident in rejecting the null hypothesis. At long last, we can express our �indings in plain English: Couples with children are less happy than couples without children.

Having walked through this �ive-step process, we note an important fact. When it comes to analyzing data, to test hypotheses, researchers actually rely on a computer program for part of this process—Step 4 in particular. In these modern times, computing even a simple means comparison by hand is rare. Software programs such as SPSS, SAS, and Microsoft Excel can take a table of data, compute the mean difference, compare it to the variability, and calculate the probability that the results are due to chance. However, because these calculations happen behind the scenes, it is very important to understand the process. By understanding how the software operates, researchers can reach informed conclusions about their research questions. Otherwise, they risk making one of two possible errors in the hypothesis test, discussed in the next section.

Errors in Hypothesis Testing

In the children and happiness study, we concluded with a reasonable amount of con�idence that our hypothesis was supported. Still, what if we made the wrong decision? Because our conclusions are based on interpreting probability, there is always a chance that we draw the wrong conclusion. In interpreting our hypothesis tests, we risk two potential errors, referred to as Type I and Type II errors.

Type I errors occur when the results are due to chance, but the researcher mistakenly concludes that the effect is signi�icant. In other words, no effect of the variables exists in the population, but some quirk of the sample makes the effect appear signi�icant. This error can be viewed as a false positive—researchers get excited over results that

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 26/35

are not actually meaningful. In our children and happiness study, a Type I error would occur if children had no effect on happiness in the real world, but some quirk of chance made our “no children” group happier than the “children” group. For example, our sample of childless couples might accidentally contain a greater proportion of people with happy personalities or greater job stability or simply more marital satisfaction from the start.

Fortunately—although this error seems worrisome—we can generally compute the probability of making it. Our alpha level sets the bar for how extreme our data must be to reject the null hypothesis. At the end of the statistical calculation, a p value tells us how extreme the data actually are. When we set an alpha threshold of, say, 0.05, we are attempting to avoid a Type I error; our results will only be statistically signi�icant if the effect outweighs the random variability by a big-enough amount. If the p value falls below our predetermined alpha level, we decide that the risk of a Type I error is suf�iciently small and can therefore reject the null hypothesis. If, however, the p value is greater than (or even equal to) our alpha cutoff, we decide that the risk of Type I error is too high to ignore and will therefore fail to reject the null hypothesis.

Type II errors occur when the results are signi�icant, but the researcher mistakenly concludes that they are due to chance. In other words, an effect of the variables does exist in the population, but some quirk of the sample makes the effect appear nonsigni�icant. This error can be viewed as a false negative—researchers miss results that actually could have been meaningful. In our children and happiness experiment, a Type II error would occur if couples without children really were happier than couples with children but some �law in the experiment kept us from detecting the difference. For example, if our measures of happiness were poorly designed, people might vary in how they interpreted the items, and this source of error could make it dif�icult to spot an overall difference between the groups.

Although this error sounds disappointing, the good news is researchers have some fairly easy ways to avoid or minimize it. The key factor in reducing Type II error is to maximize the power of the statistical test, or the probability of detecting a real difference. In fact, power is inversely related to the probability of a Type II error— the higher the power, the lower the chance of Type II error. Power is analogous to the sensitivity, or accuracy, of the hypothesis test; it is under the researcher’s control in three main ways. First, as the section Reliability and Validity discussed it is important to make sure that measures are capturing what the researcher thinks they are. If the happiness scale actually captures something like narcissism, then this will cause problems for the hypothesis about the predictors of happiness. Second, it is important to be careful throughout the process of coding and analyzing data. Small mistakes can occur at every step, from entering data, to calculating scale totals, to choosing an inappropriate analysis. And third, statistical tests generally have more power when the sample is larger. We will discuss each of these factors in more detail as we move through the course.

Research: Thinking Critically

The Truth About Cats and Dogs

Follow the link below to a press release on the website of the American Psychological Association. This press release describes a compelling research �inding, from the social psychologist Allen McConnell, that examines the bene�its of pet ownership for people’s mental health. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.

http://www.apa.org/news/press/releases/2011/07/cats-dogs.aspx (http://www.apa.org/news/press/releases/2011/07/cats-dogs.aspx)

Think About It:

1. In the �irst study described, 217 people answered surveys about well-being, and the researchers compared responses of pet owners to those of nonowners.

a. Is this study descriptive, correlational, or experimental?

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 27/35

b. Can we infer a causal relationship from this study? Explain. c. Is there a possible directionality problem or third variable problem? Explain.

2. In the third study, what is the independent variable? What is the dependent variable? 3. What are the null hypotheses being tested in each of these studies? What are the alternate

hypotheses? 4. What would a Type I decision error be in these studies? A Type II decision error?

Summary of Correct and Incorrect Decisions In the real world, at the level of the entire population, our null hypothesis is either true or false. That is, if we could test our hypothesis by surveying every married couple in the world, we could say with 100% certainty whether or not the hypothesis was true. However, in each individual study, at the level of our sample, we have to decide either to reject the null or fail to reject it. Table 2.3 summarizes the four possible outcomes of a decision about a hypothesis test. In the top left and bottom right cells, we make the right decision—either rejecting a null hypothesis that is false or failing to reject one that is true in the population. In the bottom left cell of the table, we make a Type I error, rejecting a null hypothesis that is actually true, and mistakenly thinking our hypothesis is supported (i.e., a false positive). In the top right cell of the table, we make a Type II error, failing to reject a null hypothesis that is actually false, and mistakenly thinking our hypothesis should be rejected (i.e., a false negative).

Table 2.3: Errors and correct decisions in hypothesis testing

Researcher’s Decision

Reject Null Fail to Reject Null

Null is FALSE Correct Decision Type II Error

Null is TRUE Type I Error Correct Decision

Chapter 1 (section 1.3) explained the process of drawing conclusions about “proof” and “disproof,” suggesting that neither one is ever possible in a single study. Now that we have covered the hypothesis-testing process, the reasoning behind rules regarding proof and disproof should be clearer. In fact, Type I and Type II errors are possible in every research study. Rejecting the null hypothesis in one study does not automatically mean that it is false, only that the null hypothesis could not explain the pattern of data in the study. Moreover, failing to reject the null in one study does not automatically mean that it is true, only that the pattern of data in the study does not support rejecting it. Science accumulates knowledge over the course of several related studies. It is only when these studies start to suggest the same conclusion that we can feel more con�ident in our decisions about the status of the null hypothesis.

Effect Size

So far, our discussion about hypothesis testing has been focused on statistical signi�icance, and we have been concerned with the probability that our results might be due to random chance. However, keep in mind an additional piece of the puzzle of interpreting results. Imagine that someone has been placed in charge of testing a new drug that might help cure depression. The researcher might start by collecting a large sample of depressed patients and giving half of them the new drug and half of them a placebo. Now imagine that the new drug reduced symptoms by 20%, compared to a 10% reduction with the placebo. Is this effect big enough to become excited? If the new drug costs twice as much as existing ones, is it worth recommending? These questions revolve around the issue of effect size, a statistic used to represent the size, or magnitude, of an effect.

Size may be calculated in several ways, but as a general rule, bigger values mean a stronger effect. One of these statistics, Cohen’s d, is calculated as the difference between two means

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 28/35

diego_cervo/iStock/Thinkstock

Effect size can be used to help determine the effectiveness of a particular drug.

divided by their pooled variability. In this case, our variability measure is something called the standard deviation, which represents the average deviation of individual scores from the mean of the group. A larger standard deviation indicates that the scores are dispersed more widely around the mean. When we use this number in calculating Cohen’s d, the resulting values can therefore be expressed in terms of standard deviations; a d of 1 indicates that the means are one standard deviation apart. How big should we expect our effects to be? Based on his analyses of typical effect sizes in the social sciences, Cohen suggests the following benchmarks: d = 0.20 is a small effect; d = 0.40 is a moderate effect; and d = 0.60 is a large effect. In other words, a “large” effect in social and behavioral sciences accounts for a little over half of a standard deviation. For comparison purposes, the effect of the polio vaccine on reducing polio symptoms was a d = 2.72 (almost three standard deviations; Oshinsky, 2006). Our children and happiness study produces a d = 3.82, but fake data are always more impressive than real data.

Effect size is useful in two primary ways. First, at the end of an experiment, we can calculate the exact size of the effect in our particular sample. This is a useful supplement to our test of statistical signi�icance because it is less dependent on sample size. If we fail to reject the null hypothesis in a small sample, the effect size might tell us whether the effect is big enough to test again with a larger sample. And, if we support our research hypothesis, the effect size provides valuable information about the usefulness of our �indings. Imagine testing two different diabetes drugs in two different studies. Say both show a statistically signi�icant reduction in symptoms, but Drug A has an effect size of d = 0.50, and Drug B has an effect size of d = 2.5. This tells us that Drug B has a larger effect and could therefore offer diabetes patients a bigger bene�it.

The second use for effect size is in deciding on our sample size before the study begins. We learned earlier that our statistical tests generally have more power in a larger sample size. So why not run 10,000 participants in every single research study? The problem is that participants take time, money, and other resources, and not every study needs 10,000 people to detect an effect. Rather than striving for perfect power in every study, researchers usually compromise and hope for 80% power, which equates to only a 20% chance of Type II error. It turns out that we also have more power when the underlying effect is larger. Thus, we can take our estimates of effect size and determine the number of people we need to achieve at least 80% power.

The best way to perform these calculations is by using any of the power calculators available over the Internet. Figure 2.5 presents an annotated example using the calculator available at http://www.stat.ubc.ca/~rollin/stats/ssize/n2.html (http://www.stat.ubc.ca/~rollin/stats/ssize/n2.html) . The values entered represent the means from our children and happiness study, plus the pooled standard deviation of 1.25. This calculation results in the previously mentioned d of 3.82. According to this calculator, we would only need two people per group to detect this effect in a future study—much cheaper and easier than 10,000.

Figure 2.5: Example of using effect size to estimate sample size

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 29/35

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 30/35

Summary and Resources

Chapter Summary This chapter has discussed several basic principles of research design and emphasized the importance of ensuring that a study uses the best and most accurate measures available. We �irst examined the three main types of research design—descriptive, correlational, and experimental. These designs increase the amount of control that a researcher has. Descriptive designs can provide rich descriptions of various phenomena, from brain tumors to voting preferences, but are unable to delve into why these things happen. Correlational designs allow researchers to predict variables from other variables but are still unable to identify a causal relationship. This limitation in correlational designs occurs for two reasons: We do not know the direction of the relationship, and it is always possible that a third variable is causing both of them. Finally, experimental designs allow researchers to state with some certainty that one variable causes another because these designs involve systematically testing the impact of variables while controlling the environment. The downside of experimental designs is that they often have to sacri�ice some realism to establish control.

The chapter focused on the importance of the accuracy and consistency of measures. In every research study, researchers start with an abstract variable and operationalize it into a measured variable. “Happiness” becomes a seven-point scale; “time” becomes the reading on a stopwatch, and so on. A researcher’s job is to evaluate the extent to which these measured variables capture the underlying concepts. One metric for evaluating this is the reliability, or consistency of the measures. Measures are more reliable when they are free from random error; we can assess this level of reliability by comparing multiple measures within the study. A second metric is the validity, or accuracy of the measures. Measures are more valid when they are free from systematic error, meaning that they measure what they claim to measure. We can generally assess validity by examining correlations with other measures, either to test the theoretical construct or to predict a behavioral criterion.

The chapter next discussed the different options for scaling and measuring variables. In addition to ensuring the accuracy and consistency of measures, it is critical to use a scaling method that matches the mathematical properties of the variable. Nominal scales represent arbitrary labels for categories; ordinal scales represent rank ordering of values; interval scales represent scales with equal intervals; and ratio scales represent variables with true zero points. A researcher should use the most powerful scale available—for example, by using behavioral counts rather than labels when possible. Nevertheless, researchers must also be aware of the limitations of the scale that they choose. While ratio scale values can be added, subtracted, divided, and multiplied, ordinal scale values cannot be manipulated. The chapter also identi�ied three primary types of measurement. Behavioral measures involve observation and systematic recording of behavior; self-report measures involve asking people to report their own thoughts; and physiological measures involve measurements of bodily processes. Because each approach has advantages and disadvantages, many researchers use converging operations over the course of a research program, making use of all three to address a broad question.

Finally, this chapter discussed the process of hypothesis testing. Regardless of the question asked, the design used, and the way data are measured, all studies involve the same process of testing hypotheses using statistical results. The text explained the �ive steps of this process: (1) Lay out the null and alternative hypotheses, (2) de�ine variables, (3) collect data, (4) calculate the appropriate statistics, and (5) make a decision about the original hypothesis. Despite our best efforts, a hypothesis test occasionally leads to incorrect conclusions. A Type I error occurs when the researcher rejects the null but should not have; a Type II error occurs when the researcher fails to reject the null but could have under better conditions. As later chapters will discuss, we can reduce the odds of both errors through careful research design and analysis. The next three chapters will cover the speci�ics of the three types of research design: descriptive (Chapter 3), correlational (Chapter 4), and experimental (Chapter 5).

Key Terms

alpha level

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 31/35

alternative hypothesis

behavioral measure

Cohen’s d

concurrent validity

construct validity

continuum of control

control group

convergent validity

converging operations

correlational research

criterion validity

dependent variable

descriptive research

descriptive statistics

directional hypothesis

directionality problem

discriminant validity

duration

effect size

experimental group

experimental research

face validity

�ixed-format

free-response

frequency

independent variable

inferential statistics

intensity

inter-item reliability

interrater reliability

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 32/35

interval scale

latency

negative correlation

nominal scale

nondirectional hypothesis

null hypothesis

ordinal scale

physiological measure

positive correlation

power

predictive validity

p value

random error

ratio scale

reliability

research design

scaling

self-report measure

standard deviation

systematic errors

test–retest reliability

third variable problem

Type I error

Type II error

validity

Chapter 2 Flashcards

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 33/35

Apply Your Knowledge 1. For each of the following research questions, tell whether the most appropriate strategy involves

descriptive, correlational, or experimental research. a. Are students more likely to cheat on exams in their �irst or last year of college? b. Does writing about a traumatic experience result in better health? c. What personality variables predict success in school?

2. Dr. Blutarsky is interested in predicting the link between poor parenting and teen alcohol abuse. To investigate this, he has parents �ill out questionnaires about their parenting styles and then waits to see how likely their children are to abuse alcohol.

a. Identify the independent and dependent variables in this study.

Independent:

Dependent:

b. What type of research design is Dr. Blutarsky using? c. Give an operational de�inition of “poor parenting” and “alcohol abuse.”

Poor parenting:

Alcohol abuse:

3. For each of the following, identify the scale of measurement: a. placing children in gifted and special-needs programs based on ability b. an “attitudes toward the president” scale, measured from 1 to 7 c. height measured in inches d. the number of drinks consumed per day

4. For each of the following abstract concepts, suggest a way to measure it using a behavioral and self-report measure:

Behavioral Self-Report

Elige un modo de estudioVer esta unidad de estudio

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 34/35

Research Scenarios: Try It

Conformity

Enjoyment of reading

Leadership ability

Paranoia

Independence

Critical Thinking Questions 1. Can a measure be reliable but not valid? Explain why or why not. 2. Explain the trade-off between Type I and Type II errors. Why might attempts to minimize one of these

in�late the other?

12/5/2017 Imprimir

https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-13,navpoint-14,navpoint-15,navpoint-16,navpoint-17,navpoint-18&content=… 35/35