discussion
2.3 Scales and Types of Measurement
One of the easiest ways to decrease error variance and thereby increase reliability and validity is to make smart choices when designing and selecting measures. Throughout this book, we will discuss guidelines for each type of research design and ways to ensure that measures are as accurate and unbiased as possible. This section examines some basic rules that apply across all three types of design. We first review the four scales of measurement and discuss the proper use of each one; we then turn our attention to three types of measurement used in psychological research studies.
Scales of Measurement
Whenever researchers perform the process of translating conceptual variables into measurable variables (i.e., operationalization; see Chapter 1, section 1.2), they must ensure that their measurements accurately represent the underlying concepts. In Chapter 1, the discussion of validity explained that this accuracy is a critical piece of hypothesis testing. For example, if researchers develop a scale to measure job satisfaction, then they need to verify that this is actually what the scale is measuring.
However, measurement accuracy has an additional, subtler dimension: We also need to be sure that the numbers used in our chosen measurement accurately reflect the underlying mathematical properties of the concept. In many cases in the natural sciences, this process is automatically precise. When we measure the speed of a falling object or the temperature of a boiling object, the underlying concepts (speed and temperature) translate directly into scaled measurements. In the social and behavioral sciences, though, this process is trickier; researchers have to decide carefully how best to represent abstract concepts such as happiness, aggression, and political attitudes. As researchers take the step of scaling variables, or specifying the relationship between a conceptual variable and numbers on a quantitative measure, they have four different scales to choose from, presented below in order of increasing statistical power and flexibility.
Nominal Scales
Nominal scales are used to label or identify a particular group or characteristic. For example, we can label a person’s gender as male or female, and we can label a person’s religion as Catholic, Protestant, Buddhist, Jewish, Muslim, Hindu, etc. In experimental designs, researchers can also use nominal scales to label the condition to which a person has been assigned (e.g., experimental or control groups). The assumption in using these labels is that members of the group have some common value or characteristic, as defined by the label. For example, everyone in the Catholic group should have similar religious beliefs, and everyone in the female group should be of the same gender.
Research studies commonly represent these labels using numeric codes in a data file, such as 1 to indicate females and 2 to indicate males. However, these numbers are completely arbitrary and meaningless—that is, males do not have more gender than females. We could just as easily replace the 1 and the 2 with another pair of numbers or with a pair of letters or names. Thus, the primary limitation of nominal scales is that the scaling itself is arbitrary, which prevents us from using these values in mathematical calculations. One helpful way to appreciate the difference between this scale and the next three is to think of nominal scales as qualitative, because they label and identify, and to think of the other scales as quantitative, because they indicate the extent to which someone possesses a quality or characteristic. The next sections explore these quantitative scales in more detail.
Ordinal Scales
HasseChr/iStock Editorial/Thinkstock
An ordinal scale can place these three women in first, second, and third, but it cannot tell you how far apart they finished in their race.
Researchers use ordinal scales to represent ranked orders of conceptual variables, such that higher numbers reflect increasing magnitude of the underlying variable. For example, beauty contestants, horses, and Olympic athletes are all ranked by the order in which they finish—first, second, third, and so on. Likewise, movies, restaurants, and consumer goods are often rated using a system of stars (i.e., 1 star is poor; 5 stars is excellent) to represent their quality. In these examples, we can draw conclusions about the relative speed, beauty, or deliciousness of the rating target. Even so, the numbers used to label these rankings do not necessarily map directly to differences in the conceptual variable. The fourth-place finisher in a race is rarely twice as slow as the second-place finisher; the beauty-contest winner is not three times as attractive as the third-place finisher; and the boost in quality between a four-star and a five-star restaurant is not the same as the boost between a two-star and three-star restaurant. Ordinal scales represent rank orders, but the numbers do not have any absolute value of their own. This type of scale, then, is more powerful than a nominal scale but still limited in that it does not allow performance of mathematical operations. For example, if an Olympic athlete finished first in the 800-meter dash, third in the 400-meter hurdles, and second in the 400-meter relay, we might be tempted to calculate her average finish as second place. Unfortunately, the properties of ordinal scales prevent us from doing this sort of calculation, because the conceptual distance between first, second, and third place would be different in each case. (That is, the runner might have won the 800-meter dash by 5 seconds, but the 400-meter relay by less than a second.) To perform any mathematical manipulation of variables requires one of the next two types of scale.
Interval Scales
Interval scales represent cases where the numbers on a measured variable correspond to equal distances on a conceptual variable. For example, temperature increases on the Fahrenheit scale represent equal intervals—warming from 40 to 47 degrees is the same increase as warming from 90 to 97 degrees. Interval scales share the key feature of ordinal scales—higher numbers indicate higher relative levels of the variable—but interval scales go an important step further. Because these numbers represent equal intervals, we are able to add, subtract, and compute averages. That is, whereas we could not calculate the athlete’s average finish, we can calculate the average temperature in San Francisco or the average age of participants.
Ratio Scales
Ratio scales go one final step further, representing interval scales that also have a true zero point, that is, the potential for a complete absence of the conceptual variable. Physical measurements, such as length, weight, and time represent ratio scales, because it is possible to have a complete absence of any of these. Most behavioral measures also represent ratio scales, as it is possible to have zero drinks per day, zero presses of a reward button, or zero symptoms of the flu. Temperature in degrees Kelvin is measured on a ratio scale because 0 degrees Kelvin indicates an absence of molecular motion. (In contrast, 0 degrees Fahrenheit is merely a center point on the temperature scale.) Contrast these measurements with many of the conceptual variables featured in psychology research—no such things as zero attitude toward gun control or zero self-esteem exist. The big advantage of having a true zero point is that it allows us to add, subtract, multiply, and divide scale values. When we measure weight, for example, it makes sense to say that a 300-pound adult weighs twice as much as a 150-pound adult. Likewise, it makes sense to say that having two drinks per day is only one-fourth as many as having eight drinks per day.
Choosing and Using Scales of Measurement
The take-home point from the discussion of these four scales of measurement is twofold. First, researchers should always use the most powerful and flexible scale possible for their conceptual variables. In many cases, no choice is possible; time is measured on a ratio scale and gender is measured on a nominal scale. But some cases permit researchers a bit more freedom in designing their study. For example, if someone were interested in correlating weight with happiness, the researcher could capture weight in a few different ways. One option would be to ask people their satisfaction with their current weight on a seven-point scale. However, the resulting data would be on an ordinal or interval scale (see discussion below), and the degree to which the researcher could manipulate the scale values would be limited. Another, more powerful option, would be to measure people’s weight on a bathroom scale, resulting in ratio-scale data. Whenever possible, it is preferable to incorporate physical or behavioral measures. But it is also preferable—actually, required—to represent data accurately. Most variables in the social and behavioral sciences do not have a true zero point and must therefore be measured on nominal, ordinal, or interval scales.
Second, researchers should always be aware of the limitations of their measurement scale. As discussed above, these scales lend themselves to different amounts of mathematical manipulation. It is not possible to calculate statistical averages with anything less than an interval scale and not possible to multiply or divide anything less than a ratio scale. What does this mean for researchers? If they have collected ordinal data, they are limited to discussing the rank ordering of the values (e.g., the critics liked Restaurant A better than Restaurant B). If they have collected nominal data, they are limited to describing the different groups (e.g., percentages of Catholics and Protestants).
One prominent grey area for both of these points is the use of attitude scales in the social and behavioral sciences. If we were to ask people to rate their attitudes about the death penalty on a seven-point rating scale, would the scale be ordinal or interval? This consideration turns out to be a contentious issue in the field. From the conservative point of view, these attitude ratings constitute only ordinal scales. We know that a 7 indicates more endorsement than a 3 but cannot say that moving from a 3 to a 4 is equivalent to moving from a 6 to a 7 in people’s minds. From the more liberal point of view, these attitude ratings can be viewed as interval scales. A researcher’s perspective is often driven by practical concerns—treating these as equal intervals allows us to compute totals and averages for our variables. Chapter 4 will return to this issue in discussing the creation of questionnaire items. For now, a good guideline is to assume that these individual attitude questions represent ordinal scales by default.
Types of Measurement
Each of the four scales of measurement can be used across a wide variety of research designs. In this section, we shift gears slightly and discuss measurement at a more conceptual, less mathematical level. The types of dependent measures used in psychological research studies can be grouped into three broad categories: behavioral, physiological, and self-report.
Behavioral Measurement
As mentioned earlier, behavioral measures are those that involve direct and systematic recording of observable behaviors. If a research question involves the ways that married couples deal with conflict, the researcher could include a behavioral measure by observing the way participants interact during an argument. Do they cut one another off? Listen attentively? Express hostility? Behaviors can be measured and quantified in one of four primary ways, as the scenario of observing married couples during conflict situations illustrates:
· Frequency measurements involve counting the number of times a behavior occurs. For example, researchers could count the number of times each member of the couple rolled his or her eyes as a measure of dismissive behavior.
· Duration measurements involve measuring the length of time a behavior lasts. For example, researchers could quantify the length of time the couple spends discussing positive versus negative topics as a measure of emotional tone.
· Intensity measurements involve measuring the strength or potency of a behavior. For example, researchers could quantify the intensity of anger or happiness in each minute of the conflict using ratings by trained judges.
· Latency measures involve measuring the delay before onset of a behavior. For example, researchers could measure the time between one person’s provocative statement and the other person’s response.
John Gottman, a psychologist at the University of Washington, has been conducting research along these lines for several decades, observing body language and interaction styles among married couples as they discuss an unresolved issue in their relationship (read more about this research and its implications for therapy on Dr. Gottman’s website, http://www.gottman.com/ ). What all of these behavioral measures provide is an unobtrusive way to measure the health of a relationship. That is, the major strength of behavioral responses is that they are typically more honest and unfiltered than responses to questionnaires. As Chapter 4 will discuss, people are sometimes dishonest on questionnaires to convey a more positive (or less negative) impression.
Behavioral responses offer a particular benefit for researchers interested in unpopular attitudes, such as prejudice and discrimination. If we were to ask people the extent to which they dislike members of other ethnic groups, they might not admit to these prejudices. Alternatively, a researcher could adopt the approach used by Yale psychologist Jack Dovidio and colleagues and measure how close people sat to people of different ethnic and racial groups, using this distance as a subtle and effective behavioral measure of prejudice (see http://www.yale.edu/intergroup/ for more information). But the primary downside to using behavioral measures may be evident: We end up having to infer the reasons that people behave as they do. Suppose that in one of these experiments, European-American participants, on average, sit farther away from African-Americans than from other European-Americans. This could—and often does—indicate prejudice; however, for the sake of argument, the farthest seat from the minority group member might also be the comfortable recliner with great lighting next to the window. To understand the reasons for behaviors, researchers have to supplement the behavioral measures with either physiological or self-report measurements.
Physiological Measurement
Physiological measures are those that involve quantifying bodily processes, including heart rate, brain activity, and facial muscle movements. If we were interested in the experience of test anxiety, we could measure heart rates as people complete a difficult math test. If we wanted to study emotional reactions to political speeches, we could measure heart rate, facial muscles, and brain activity as people view video clips. These types of measures’ big advantage is that they are the least subjective and controllable. It is incredibly difficult for people to control their heart rate or brain activity consciously, making these a great tool for assessing emotional reactions. However, as with behavioral measures, we also need some way to contextualize physiological data.
The best example of this shortcoming is the use of the polygraph, or lie detector, to detect deception. The lie-detector test involves connecting a variety of sensors to the body to measure heart rate, blood pressure, breathing rate, and sweating. All of these are physiological markers of the body’s fight-or-flight stress response, and the test’s goal is to measure whether someone shows signs of stress while being questioned. But here is the problem: Being falsely accused is also stressful. A trained polygraph examiner must place all of the accused’s physiological responses in the proper context. Is the individual stressed throughout the exam or only stressed when asked whether he pilfered money from the cash box? Is the person stressed when asked about her relationship with her spouse because she killed him or because she was having an affair? The examiner has to be extremely careful to avoid false accusations based on misinterpretations of physiological responses. (For a recent commentary on the use of the polygraph in the courtroom, see http://www.thedailybeast.com/articles/2015/02/04/the-polygraph-has-been-lying-for-90-years.html ). The same cautions apply to using these measures in psychological research: Does heart rate increase because participants are stressed by a political message, or because the experiment is taking too long, and they are late to another appointment? The researcher should always include additional measures in the study to help sort out the reasons behind physiological change.
Self-Report Measurement
Digital Vision/Photodisc/Thinkstock
A self-report measure might be used to determine how likely voters are to support a candidate.
Self-report measures are those that involve asking people to report on their own thoughts, feelings, and behaviors. If we were interested in the relationship between income and happiness, we could simply ask people to report their income and their level of happiness. If we wanted to know whether people were satisfied in their romantic relationships, we could simply ask them to rate their degree of satisfaction. The major advantage of these measures is that they provide access to internal processes. That is, if we want insight into why people voted for their favorite political candidate, the only option is to ask them. However, as the text has suggested already, people may not necessarily be honest and forthright in their answers, especially when dealing with politically incorrect or unpopular attitudes. Chapter 4 will return to this tension and discuss ways to increase the likelihood of honest self-reported answers.
Two broad categories of self-report measures can be used. One of the most common approaches is to ask for people’s responses using a fixed-format scale, which asks them to indicate their opinion on a preexisting scale. For example, a researcher might ask people, “How likely are you to vote for the Republican candidate for president?” on a scale from 1 (not likely) to 7 (very likely). The other broad approach is to obtain responses using a free-response format, which asks people to express their opinion in an open-ended format. For example, researchers might ask people to explain, “What are the factors you consider in choosing a political candidate?” The trade-off between these two categories is essentially a choice between data that is easy to code and analyze and data that is rich and complex. In general, fixed-format scales are used more in quantitative research, while free-response formats are used more in qualitative research. Chapter 4 will discuss these categories further in a discussion of survey research.
Research: Thinking Critically
Neuroscience and Addictive Behaviors
Follow the link below to read an article by journalist Christian Nordqvist. In this article, Nordqvist reviews recent research suggesting that food addiction might involve brain mechanisms similar to those involved in drug addiction. As you read the article, consider what you have learned so far about the research process, and then respond to the questions below.
http://www.medicalnewstoday.com/articles/221233.php
Think About It:
1. Is the study described here descriptive, correlational, or experimental? Explain.
2. Can we conclude from this study that food addiction causes brain abnormalities? Why or why not?
3. The authors of the study concluded: “The current study also provides evidence that objectively measured biological differences are related to variations in YFAS (Yale Food Addiction Scale) scores, thus providing further support for the validity of the scale.” What type(s) of validity are they referring to? Explain.
4. What types of measures are included in this study (e.g., behavioral, self-report)? What are the strengths and limitations of these measures in this study?
Converging Operations: The Best of All Worlds
As these descriptions show, each type of measurement has its strengths and flaws. So, how do researchers decide which one to use? This question has to be answered for every case, and the answer involves consideration of three factors: fit with the research question; insights from previous research; and practical considerations like budget and equipment availability. However, in an ideal world, a program of research will use a wide variety of measures and designs. The term for this approach is converging operations, or the use of multiple research methods to solve a single problem. In essence, over the course of several studies—perhaps spanning several years—a researcher would address a research question using different designs, different measures, and different levels of analysis.
One good example of converging operations comes from the research of psychologist James Gross and his colleagues at Stanford University. Gross and his team study the ways that people regulate their emotional responses and has conducted this work using everything from questionnaires to brain scans (see http://spl.stanford.edu/projects.html ).
One branch of Gross’s research has examined the consequences of trying to either suppress emotions (pretend they are not happening) or reappraise them (think of them in a different light). Gross’s team studies suppression by asking people to hold in their emotional reactions while watching a graphic medical video. The researchers study reappraisal by asking people to watch the same video while trying to view it as a medical student, thus changing the meaning of what they see. When people try to suppress their emotional responses, they experience an ironic increase in physiological and self-reported emotional responses, as well as deficits in cognitive and social functioning. When reappraising emotions, on the other hand, people experience lower levels of both reported and physiological emotion, without any loss of other functioning. In another branch of the research, Gross and colleagues have examined the neural processes at work when people change their perspective about an emotional event. In yet another branch of the research, they have used self-report measures to examine individual differences in emotional responses, with the goal of understanding why some people are more capable of managing their emotions than others. Taken together, these studies all converge into a more comprehensive picture of the process of emotion regulation than would be possible from any single study or method.