discussion
2.2 Reliability and Validity
Each of the three types of research designs—descriptive, correlational, and experimental—has the same basic goal: to take a hypothesis about some phenomenon and translate it into measurable and testable terms. That is, whether researchers use a descriptive, correlational, or experimental design to test predictions about income and happiness, they still need to translate (or operationalize) the concepts of income and happiness into measures that will be meaningful for the study. Unfortunately, the sad truth is that research measurements will always be influenced by factors in addition to the conceptual variable of interest. Answers to any set of questions about happiness will depend both on actual levels of happiness and the ways people interpret the questions. The meditation experiment may have different effects, depending on people’s experience with meditation. Even describing the percentage of Republicans voting for independent candidates will vary according to characteristics of a particular candidate.
These additional sources of influence can be grouped into two categories: random and systematic errors. Random error involves chance fluctuations in measurements, such as when a participant misunderstands the question, or shows up in a terrible mood after walking through a blizzard to get to the study. Although random errors can influence measurement, they generally cancel out over the span of an entire sample. That is, some people may overreact to a question while others underreact. Heavy snowfall might put one person in a terrible mood and make another appreciate the joy of winter. While both of these examples would add error to a dataset, they would cancel each other out in a sufficiently large sample.
Systematic errors, in contrast, are those that systematically increase or decrease along with values of the measured variable. For example, people who have more experience with meditation may consistently show more improvement in a meditation experiment than those with less experience. Or, people who have higher self-esteem may score higher on a measure of happiness than those with lower self-esteem. In this case, the happiness scale does not do a good job of homing in on the concept of “happiness” and will end up instead assessing a combination of happiness and self-esteem. These types of errors can cause more serious trouble for a researcher’s hypothesis tests because they interfere with the attempts to understand the link between two variables.
In sum, the measured values of a variable reflect a combination of the true score, random error, and systematic error, as the following conceptual equation shows:
Measured Score = True Score + (Random Error + Systematic Error)
For example:
Happiness Score = Actual Happiness + (Misreading the Question + Self-Esteem)
So, if our measurements are also affected by outside influences, how do we know whether our measures are meaningful? Occasionally, the answer to this question is straightforward; if we ask people to report their weight or their income level, these values can be verified using objective sources. Many research questions within psychology, however, involve more ambiguity. How do we know that our happiness scale is accurate? The problem is that we have no way to objectively verify happiness beyond people’s self-reports of their own happiness. What researchers need, then, are ways to assess how close they are to measuring happiness in a meaningful way. This assessment involves two related concepts: reliability, or the consistency of a measure; and validity, or the accuracy of a measure. This section examines both of these concepts in detail.
Reliability
The consistency of time measurement by watches, cell phones, and clocks reflects a high degree of reliability. People think of a watch as reliable when it keeps track of the time consistently—an hour should take the same amount of time to pass, 24 times per day. Likewise, the scale is reliable when it gives the same value for weight in back-to-back measurements—an individual’s weight should be the same if he steps off the scale and right back on, provided he stays away from the fridge in the meantime.
Reliability is defined as the extent to which a measured variable is free from random errors, and it is best understood as the degree of consistency in research measurements. As the chapter discussed previously, researchers’ measures are never perfect, and five main sources of random error threaten reliability:
1. Transient states, or temporary fluctuations in participants’ cognitive or mental state; for example, some participants may complete a study after an exhausting midterm or after a fight with their significant others.
2. Stable individual differences among participants; for example, some participants are habitually more motivated or happier than other participants.
3. Situational factors in the administration of the study; for example, an experiment conducted in the early morning may make everyone tired or grumpy.
4. Bad measures that add ambiguity or confusion to the measurement; for example, participants may respond differently to a question about “the kinds of drugs you are taking.” Some may take this to mean illegal drugs, whereas others interpret it as prescription or over-the-counter drugs.
5. Mistakes in coding responses during data entry; for example, a handwritten “7” could be mistaken for a “4.” (Happily, these types of errors have been minimized by the increasing role of computers in data collection. If someone clicks the number “7” in an online survey, the computer will record it as a “7” almost every time.)
Researchers naturally want to minimize the influence of all of these sources of error, and the text will touch on techniques for doing so throughout. However, researchers are also resigned to the fact that all measurements contain a degree of error. The goal, then, is to develop an estimate of how reliable measures are. Researchers generally estimate reliability in three ways.
1. Test–retest reliability refers to the consistency of the measure over time—much like the examples of a reliable watch and a reliable scale. A fair number of research questions in the social and behavioral sciences involve measuring stable qualities. For example, if someone were to design a measure of intelligence or personality, both of these characteristics should be relatively stable over time. An individual score on an intelligence test today should be roughly the same as the score when tested again in five years. A person’s level of extroversion today should correlate highly with his or her level of extroversion in 20 years. The test–retest reliability of these measures is quantified by simply correlating measures at two time points. The higher these correlations are, the higher the reliability will be. This makes conceptual sense as well; if measured scores reflect the true score more than they reflect random error, then this will result in increased stability of the measurements.
2. Inter-item reliability refers to the internal consistency among different items on a measure. Think back to the last time you completed a survey. Did it seem to ask the same questions more than once? (Chapter 4 [4.1] will discuss this technique.) The repetition is included because a single item is more likely to contain measurement error than the average of several items will—remember that small random errors tend to cancel out each other. Consider the following items from Sheldon Cohen’s Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983):
· In the last month, how often have you felt that you were unable to control the important things in your life?
· In the last month, how often have you felt confident about your ability to handle your personal problems?
· In the last month, how often have you felt that things were going your way?
· In the last month, how often have you felt difficulties were piling up so high that you could not overcome them?
Each of these items taps into the concept of feeling “stressed out,” or overwhelmed by the demands of life. One standard way to evaluate a measure like this is by computing the average correlation between each pair of items, a statistic referred to as Cronbach’s alpha. The more these items tap into a central, consistent construct, the higher the value of this statistic is. Conceptually, a higher alpha means that variation in responses to the different items reflects variation in the “true” variable being assessed by the scale items. Alpha levels range from zero to one, with higher numbers indicating more internal consistency. As a general rule, researchers want this index to be above 0.70 to have any confidence in the measure.
3. Interrater reliability refers to the consistency among judges observing participants’ behavior. The previous two forms of reliability were relevant in dealing with self-report scales; interrater reliability is more applicable when the research involves behavioral measures, which involve direct and systematic recording of observable behaviors. Imagine a researcher is studying whether alcohol consumption makes people behave more aggressively. One way to tackle this hypothesis would be to have a group of judges observe participants after drinking and rate their levels of aggression. In the same way that using multiple scale items helps to cancel out the small errors of individual items, using multiple judges cancels out the variations in each individual’s ratings. In this case, people could have slightly different ideas and thresholds for what constitutes aggression. To determine how much these differences matter, the researcher can evaluate the judges’ ratings by calculating the average correlation among the ratings. The higher the alpha values, the more the judges agree in their ratings of aggressive behavior. Conceptually, a higher alpha value means that variation in the judges’ ratings reflects real variation in levels of aggression.
Validity
Recall the watch and scale examples. Perhaps some people set their watch 10 minutes ahead to avoid being late. Or perhaps certain individuals adjust their scale by 5 pounds to boost either their motivation or self-esteem. In these cases, the watch and the scale may produce consistent measurements, but the measurements are not accurate. It turns out that the reliability of a measure is a necessary but not sufficient basis for evaluating it. Put bluntly, measures can be (and have to be) consistent, but they might still be worthless. The additional piece of the puzzle is the validity of measures, or the extent to which they accurately measure what they are designed to measure.
Whereas reliability is threatened more by random error, validity is threatened more by systematic error. If the measured scores on the happiness scale reflect, say, self-esteem more than they reflect happiness, this would threaten the validity of the scale. The previous section explained that a test designed to measure intelligence ought to be consistent over time. And, in fact, these tests do show very high degrees of reliability. However, several researchers have cast serious doubts on the validity of intelligence testing, arguing that even scores on an official IQ test are influenced by a person’s cultural background, socioeconomic status (SES), and experience with the process of test-taking (for discussion of these critiques, see Daniels et al., 1997; Gould, 1996). For example, children growing up in higher SES households tend to have more books in the home, spend more time interacting with one or both parents, and attend schools that have more time and resources available—all of which correlate with scores on IQ tests. Thus, because all of these factors could increase scores on an intelligence test, they amount to systematic error in the measure of intelligence and, therefore, threaten the validity of a measured score on an intelligence test.
Researchers have two primary ways to discuss and evaluate the validity, or accuracy, of measures: construct validity and criterion validity.
Researchers evaluate construct validity based on how well the measures capture the underlying conceptual ideas (i.e., the constructs) in a study. These constructs are equivalent to the “true score” discussed in the previous section. That is, how accurately does the bathroom scale measure the construct of weight? How accurately does an IQ test measure the construct of intelligence relative to other things? The validity of measures can be assessed in a couple of ways. On the subjective end of the continuum, researchers can evaluate construct validity by assessing the face validity of the measure, or the extent to which it simply seems like a good measure of the construct. The items from the Perceived Stress Scale have high face validity because the items match what we intuitively mean by “stress” (e.g., “How often have you felt difficulties were piling up so high that you could not overcome them?”). However, if we were to measure an individual’s speed at eating hot dogs and then state it was a stress measure, the participant might be skeptical because hot-dog eating speed would lack face validity as a measure of stress.
Although face validity is nice to have, it can sometimes (ironically) reduce the validity of the measures. Imagine seeing the following two measures on a survey of attitudes:
1. Do you dislike people whose skin color is different from yours?
2. Do you ever beat your children?
On the one hand, these are extremely face-valid measures of attitudes about prejudice and corporal punishment—the questions very much capture our intuitive ideas about these concepts. On the other hand, even people who do support these attitudes may be unlikely to answer honestly because they recognize that neither attitude is popular. In cases like this, a measure low in face validity might end up being the more accurate approach. Chapter 4 will discuss ways to strike this balance.
On the less subjective end, researchers can evaluate construct validity by examining measures’ empirical connections to both related and unrelated constructs. Imagine for a moment that we are developing a new measure of liberal political attitudes. If we think about a person who describes herself as liberal, she is likely to support gun control, equal rights, and a woman’s right to choose. And, she is less likely to be pro-war, anti-immigration, or anti-gay rights. Therefore, we would expect our new liberalism measure to correlate positively with existing measures of attitudes toward guns, affirmative action, and abortion. This pattern of correlations taps into the metric of convergent validity, or the extent to which our measure overlaps with conceptually similar measures. But, we would want to ensure that the new measure captures something distinct from other constructs. In this case, we might want to demonstrate that we have developed a true measure of political attitudes, which does not simply correlate with religious beliefs. That is, we would want to show that liberal political views could be independent of religion. This hypothesized lack of correlations taps into the metric of discriminant validity, or the extent to which a measure diverges from unrelated measures.
To take another example, imagine someone wanted to develop a new measure of narcissism, usually defined as an intense desire to be liked and admired by other people. Narcissists tend to be self-absorbed but also very attuned to the feedback they receive from other people—especially feedback about the extent to which people admire them. Narcissism somewhat resembles self-esteem but differs enough; perhaps it is best viewed as high and unstable self-esteem. So, given these facts, a researcher might assess the discriminant validity of the measure by making sure it does not overlap too closely with measures of self-esteem or self-confidence. This approach would establish that the narcissism measure stands apart from these different constructs. The researcher might then assess the convergent validity of the measure by making sure that it does correlate with things like sensitivity to rejection and need for approval. These correlations would place the measure into a broader theoretical context and help to establish it as a valid measure of the construct of narcissism.
Jupiterimages/Stockbyte/Thinkstock
Criterion validity can be used to predict a future behavioral outcome like management success.
Criterion validity involves evaluating the validity of measures based on the association between measures and relevant behavioral outcomes. The “criterion” in this case refers to a measure that can be used to make decisions. For example, if someone developed a personality test to assess an individual’s management style, the most relevant metric of its validity is whether it predicts a person’s actual behavior as a manager. That is, we might expect people scoring high on this scale to be able to increase the productivity of their employees and to maintain a comfortable work environment. Likewise, if someone developed a measure that predicted the best careers for graduating seniors based on their skills and personalities, then criterion validity would be assessed using people’s actual success in these various careers. Whereas construct validity is more concerned with the underlying theory behind the constructs, criterion validity is more concerned with the practical application of measures. As might be expected, researchers are more likely to use this approach in applied settings.
That said, criterion validity is also a useful way to supplement validation of a new questionnaire. For example, a questionnaire about generosity should be able to predict people’s annual giving to charities, and a questionnaire about hostility ought to predict hostile behaviors. To supplement the construct validity of the narcissism measure, a researcher might examine its ability to predict the ways people respond to rejection and approval. Based on the definition of the construct, the researcher might hypothesize that narcissists would become hostile following rejection and perhaps become eager to please following approval. If these predictions were supported, it would mean further validation that the measure was capturing the construct of narcissism.
Criterion validity falls into one of two categories, depending on whether the researcher is interested in the present or the future. Predictive validity involves attempting to predict a future behavioral outcome based on the measure, as in the examples of the management-style and career-placement measures. Predictive validity is also at work when researchers (and colleges) try to predict graduates’ likelihood of school success based on SAT or GRE scores. The goal here is to validate the construct via its ability to predict the future.
In contrast, concurrent validity involves attempting to link a self-report measure with a behavioral measure collected at the same time, as in the examples of the generosity and hostility questionnaires. The phrase “at the same time” is used vaguely here; these self-report and behavioral measures may be separated by a short time span. In fact, concurrent validity sometimes involves trying to predict behaviors that occurred before completion of the scale, such as trying to predict students’ past drinking behaviors from an “attitudes toward alcohol” scale. The goal in this case is to validate the construct via its association with similar measures.
Comparing Reliability and Validity
This section has discussed how both reliability (consistency) and validity (accuracy) are ways to evaluate measured variables and to assess how well these measurements capture the underlying conceptual variable. In establishing estimates of both of these metrics, researchers essentially examine a set of correlations with their measured variables. But while reliability involves correlating variables with themselves (e.g., happiness scores at week 1 and week 4), validity involves correlating variables with other variables (e.g., happiness scale with the number of times a person smiles). Figure 2.3 displays the relationships among types of reliability and validity.
Figure 2.3: Types of reliability and validity
We learned earlier that reliability is necessary but not sufficient to evaluate measured variables. That is, reliability has to come first and is an essential requirement for any variable—no one would trust a watch that was sometimes five minutes fast and other times ten minutes slow. If we cannot establish that a measure is reliable, then there is really no chance of establishing its construct validity because every measurement might be a reflection of random error. However, just because a measure is consistent does not make it accurate. Someone’s watch might consistently be ten minutes fast; a scale might always be five pounds under the person’s actual weight. For that matter, a test of intelligence might result in consistent scores but actually be capturing respondents’ cultural background. Reliability tells us the extent to which a measure is free from random error. Validity takes the second step of telling us the extent to which the measure is also free from systematic error.
Finally, it is worth pointing out that establishing validity for a new measure is hard work. Reliability can be tested in a single step by correlating scores from multiple measures, multiple items, or multiple judges within a study. But testing the construct validity of a new measure involves demonstrating both convergent and discriminant validity. In developing our narcissism scale, we would need to show that it correlated with things like fear of rejection (convergent) but was reasonably different from things like self-esteem (discriminant). The latter criterion is particularly difficult to establish because it takes time and effort—and multiple studies—to demonstrate that one scale is distinct from another. However, an easy way exists to avoid these challenges: Use existing measures whenever possible. Before creating a brand-new happiness scale, or narcissism scale, or self-esteem scale, check the research literature to see if one exists that has already gone through the ordeal of being validated.