MeasurementandDataQuality.docx

14: Measurement and Data Quality

·  For additional ancillary materials related to this chapter, please visit thePoint.

In quantitative studies, an ideal data collection procedure is one that measures a construct accurately, soundly, and with precision. Biophysiologic methods have a higher chance of success in attaining these goals than self-report or observational methods, but no method is flawless. In this chapter, we discuss criteria for evaluating the quality of data obtained by measuring constructs with structured instruments. We note that the field of measurement in health fields is evolving; a fuller discussion of the new directions and controversies, and a more detailed presentation of statistical issues in measurement, is provided in Polit and Yang (2015). We begin by discussing principles of measurement.

MEASUREMENT

Quantitative studies obtain data through the measurement of constructs. Clinicians also require that phenomena of interest be measured.  Measurement  involves assigning numbers to represent the amount of an attribute present in a person or object. Attributes are not constant: They vary from day to day or from one person to another. Variability is presumed to be capable of a numeric expression signifying how much of an attribute is present. The purpose of assigning numbers is to differentiate between people with varying degrees of the attribute.

Rules and Measurement

Measurement involves assigning numbers according to rules. Rules are necessary to promote consistency and interpretability. The rules for measuring temperature, weight, and other physical attributes are familiar to us. Rules for measuring constructs such as nausea or quality of life, however, have to be invented. Whether the data are collected by observation, self-report, or some other method, researchers must specify criteria for assigning numeric values to the characteristic of interest. When researchers or clinicians invent a set of rules to gauge a construct, they create a  measure  of the construct. Measures yield  scores —numeric values that communicate how much of an attribute is present or whether it is present at all.

The rules for measuring constructs must be evaluated to see if they are good rules. It is not enough to have rules—the rules must yield quantitative information that truly and accurately corresponds to different amounts of the targeted trait. New measurement rules reflect hypotheses about how attributes vary. The adequacy of the hypotheses—that is, the worth of the measurements—needs to be assessed empirically.

Researchers (and clinicians) work with fallible measures. Instruments that measure psychosocial phenomena by means of self-reports or observation are more error-prone than physical measures, but few measurements are error-free.

Advantages of Measurement

What exactly does measurement accomplish? Consider how handicapped health care professionals would be in the absence of measurement. For example, what if there were no measures of body temperature or blood pressure? A major strength of measurement is that it removes subjectivity and guesswork. Because measurement is based on explicit rules, resulting information tends to be objective—that is, it can be independently verified. Two people measuring a person’s weight using the same scale would likely get identical results. Most measures incorporate mechanisms for minimizing subjectivity.

Measurement also makes it possible to obtain reasonably precise information. Instead of describing Alex as “rather tall,” we can depict him as being 6 feet 3 inches tall. With precise measures, researchers can differentiate among people with different degrees of an attribute.

Finally, measurement is a language of communication. Numbers are less vague than words. If a researcher reported that the average oral temperature of a sample of patients was “high,” different readers might interpret the sample’s physiologic state differently. However, if the researcher reported an average temperature of 99.8°F, there would be no ambiguity.

Theories of Measurement

Psychometrics  is the branch of psychology concerned with the theory and methods of psychological measurement. Health measurement has been strongly influenced by psychometrics, although differences in aims and conceptualizations have begun to emerge. When new measures are developed and tested, researchers often say that they are undertaking a psychometric assessment.

Within psychometrics (and health measurement), two theories of measurement have been influential.  Classical test theory (CTT)  is a psychometric theory of measurement that has been dominant until fairly recently. CTT has been used as a basis for developing multi-item measures of health constructs and is also appropriate for conceptualizing all types of measurements (e.g., biophysiologic measures). An alternative measurement theory ( item response theory  or  IRT ) gaining in popularity is discussed in  Chapter 15 . Unlike CTT, IRT is an appropriate framework only for multi-item scales and tests.

Errors of Measurement

Procedures for obtaining measurements, as well as the objects being measured, are susceptible to influences that can alter the resulting data. Some influences can be controlled or minimized, and attempts should be made to do so, but such efforts are rarely completely successful.

Instruments that are not perfectly accurate yield measurements containing some error. Within classical test theory, an  observed  (or  obtained score  can be conceptualized as having two parts—an error component and a true component. This can be written as follows:

Obtained score = True score ± Error

or

XO = XT ± XE

The first term in the equation is an observed score—for example, a score on an anxiety scale. XT is the value that would be obtained with an infallible measure. The  true score  is hypothetical—it can never be known because measures are not infallible. The final term is the  error of measurement . The difference between true and obtained scores results from factors that distort the measurement.

Decomposing obtained scores in this manner highlights an important point. When researchers measure an attribute, they are also measuring attributes that are not of interest. The true score component is what they wish to isolate; the error component is a composite of other factors that are also being measured, contrary to their wishes. We illustrate with an exaggerated example. Suppose a researcher measured the weight of 10 people on a spring scale. As participants step on the scale, the researcher places a hand on their shoulders and applies pressure. The resulting measures (the XOs) will be biased upward because scores reflect both actual weight (XT) and pressure (XE). Errors of measurement are problematic because their value is unknown and also because they often are variable. In this example, the amount of pressure applied likely would vary from one person to the next. In other words, the proportion of true score component in an obtained score varies from one person to the next.

Many factors contribute to errors of measurement. Some errors are random, while others are systematic, reflecting bias. Common sources of measurement error include the following:

· 1. Transient personal factors. A person’s score can be influenced by such personal states as fatigue or mood. In some cases, such factors directly affect the measurement, as when anxiety affects pulse rate measurement. In other cases, personal factors alter scores by influencing people’s motivation to cooperate, act naturally, or do their best.

· 2. Situational contaminants. Scores can be affected by the conditions under which they are produced. A participant’s awareness of an observer’s presence (reactivity) is one source of bias. Environmental factors, such as temperature, lighting, and time of day, are potential sources of measurement error.

· 3. Response-set biases. Relatively enduring characteristics of people can interfere with accurate measurements. Response sets such as social desirability or acquiescence are potential biases in self-report measures ( Chapter 13 ).

· 4. Administration variations. Alterations in the methods of collecting data from one person to the next can result in score variations unrelated to variations in the target attribute. For example, if some physiologic measures are taken before a feeding and others are taken after a feeding, then measurement errors can potentially occur.

· 5. Instrument clarity. If the directions on an instrument are poorly understood, then scores may be affected. For example, questions in a self-report instrument may be interpreted differently by different respondents, leading to a distorted measure of the variable.

· 6. Item sampling. Errors can be introduced as a result of the sampling of items used in the measure. For example, a nursing student’s score on a 100-item test of critical care nursing knowledge will be influenced by which 100 questions are included. A person might get 94 questions correct on one test but 92 right on another similar test.

  TIP: The Toolkit section of  Chapter 14  of the Resource Manual includes a list of suggestions for enhancing data quality and minimizing measurement error in quantitative studies.

Major Types of Measures

Measurements for nursing research and practice can vary in a number of ways. For example, measurements can vary in terms of information source (i.e., self-reports, observation, etc.), complexity (e.g., a simple visual analog scale or a multidimensional scale with dozens of items), and type of scores they yield (e.g., continuous scores, categorical scores). Some measures are designed to be generic—that is, broadly applicable across different clinical or nonclinical populations; other measures are specific—that is, designed for use with specific groups of people. For example, there are self-efficacy scales that are generic, but there are many disease-specific self-efficacy scales (e.g., for diabetes or asthma).

Static and Adaptive Measures

Multi-item measures also differ with regard to whether they are static or adaptive. A static measure is administered in a comparable manner for everyone being measured. For a static composite scale, people complete an entire set of items and then are scored based on responses to all items. Most health-related measures are static. As an example, a widely used generic measure of depression is called the Center for Epidemiologic Studies Depression Scale, the CES-D (Radloff, 1977). Total scores on the CES-D rely on responses to the same 20 questions for everyone. Much of this book uses static scales to illustrate key measurement concepts.

An adaptive measure, by contrast, involves using responses to early questions to guide the selection of subsequent questions. Dynamic adaptive measures are becoming popular as a way to obtain precise information about an attribute with minimum respondent burden. Adaptive testing has its origin in measurement advances from item response theory.  Item banks  with hundreds of items have been created for broad health topics, such as physical function, pain, or sleep disturbance. The most  important example of item banking is PROMIS® (Patient Reported Outcomes Measurement Information System), developed with support from the U.S. National Institutes of Health (Cella et al., 2007). An approach called  computerized adaptive testing (CAT)  uses these item banks to create measurements that are tailored to individuals. With such tailoring, the set of items used to measure a construct can be different for each patient. Despite item differences, cross-patient comparisons can be made because the testing places people along a dimension of interest.

Reflective Scales and Formative Indexes

An important distinction is whether a multi-item measure is formative or reflective, which concerns the nature of the relationship between a construct and the measure of the construct. Constructs are not directly observable—they must be inferred by the effects they have on observables, such as responses to items on a patient-reported outcome (PRO) or behaviors witnessed and recorded on an observational scale. Most health scales are  reflective scales : The items are viewed as reflections of the construct. For example, on the CES-D, it is presumed that a person’s underlying level of depression causes him or her to respond in a certain way to the items about sleep disturbance, sadness, and so on. The items on a reflective scale share a common cause—in this case, level of depression. Items on reflective scales are expected to be interrelated because they all reflect (are caused by) the construct.

Not all multi-item instruments, however, are reflective. A multi-item measure can be conceptualized as having items that “cause” or define the attribute (rather than being the effect of the attribute). Such measures are called formative measures. Several writers advocate using the term scale for multi-item reflective measures, and the term  index  for multi-item formative measures (DeVellis, 2012; Streiner, 2003). A formative index involves constructs that are formed by its components, rather than causing them.

A good illustration of a formative index is the Holmes-Rahe Social Readjustment Scale, which is a measure of stress. Psychiatrists Holmes and Rahe studied whether stressful life events might cause illness and devised an index that asked patients to indicate which of 43 life events they had experienced in the previous year (Holmes & Rahe, 1967). Examples of life event items include death of a spouse, pregnancy, and change in residence. The life events are assigned different weights or “life change units” (e.g., 100 for death of a spouse, 20 for a change in residence), and the units are then added together. The sum of life change units defines the construct of stressful life events. The items are not the “effect” of the construct—for example, having high stress does not “cause” the death of a spouse or a residential move.

Because the items on an index are not caused by an underlying construct, they are not necessarily intercorrelated. In fact, items with modest correlations that capture different aspects of an attribute are often desired in a formative index. Many screening tools are formative and are composed of components that independently predict an outcome.

The development of reflective scales and formative indexes is necessarily different. For example, because the items on a formative index define the attribute, the specific items matter very much. If the item “I had crying spells” on the CES-D scale was removed, for example, the other 19 items could carry most of the burden of measuring depression. But if the item “Death of a spouse” was removed from the Holmes-Rahe index, the score would misrepresent the stress levels of people who had lost a spouse. Another consequence of having noncorrelated items on a formative index is that some of the standard assessment methods associated with CTT are not appropriate, as we explain later in this chapter.

 TIP: Formative indexes are seldom created using standard psychometric approaches. Formative indexes are sometimes developed within the field of clinimetrics, which is devoted to the development of measures of clinical phenomena. Polit and Yang (2015) have written a chapter on clinimetrics in their measurement book.

MEASUREMENT PROPERTIES: AN OVERVIEW

In making decisions about how to measure their constructs, careful researchers select instruments that are known to be psychometrically sound—that is, ones that have good  measurement properties . Psychometricians have traditionally focused on two measurement properties when assessing the quality of a measure: reliability and validity. Measurement experts in health disciplines, however, have taken a broader view of the measurement properties of an instrument.

A Measurement Taxonomy

The field of health measurement was in some turmoil for many years with regard to measurement terminology and definitions. Recently, a working group in the Netherlands used a Delphi-type approach with a panel of health measurement experts to identify key measurement properties and to develop a taxonomy and definitions of those properties. The result was the creation of  COSMIN , the Consensus-based Standards for the selection of health Measurement Instruments (Mokkink et al., 2010a, 2010b; Terwee et al., 2012). (Information about COSMIN can be accessed at  http://www.cosmin.nl .) Polit and Yang (2015), building on the groundbreaking COSMIN work, made small modifications to the taxonomy to more clearly incorporate a time perspective. A graphic depiction of the Polit-Yang measurement taxonomy is shown in  Figure 14.1 .

In this taxonomy, there are four measurement property domains. Two are cross-sectional—that is, they concern the quality of measurements at one point in time. These cross-sectional domains are reliabilityand validity, the properties used for decades by psychometricians. Two other domains in the taxonomy concern longitudinal measurement—that is, the quality of measurements capturing changes over time. These two domains are called the reliability of change scores and responsiveness. New measures that are likely to be used to measure a construct at a single point and to assess how the construct changed over time ideally would be evaluated for all four measurement properties. The taxonomy also incorporates another concept—interpretability—that has relevance for both point-in-time scores and change scores.

Each measurement property can be evaluated by estimating  measurement parameters  that quantify the degree to which the scores on the measure have desirable properties. These estimates are the means by which conclusions can be drawn about an instrument’s quality. Estimates of measurement properties are relevant for particular applications and particular populations, and so researchers need to carefully consider the comparability of their sample to the sample used in measurement assessments of a given instrument.

  TIP: The Toolkit for  Chapter 14  in the accompanying Resource Manual includes a summary table that specifies measurement parameters that are relevant under different scenarios.

The four measurement property domains and the two interpretability aspects correspond to six key measurement questions, which we illustrate with an example. Suppose we were testing the effects of a nurse-led support program for family caregivers of patients with dementia and one of our outcome variables was depression. Suppose that we found that a participant in the intervention group had a score of 20 on the CES-D at baseline (high level of depression) and a score of 15 (less depression) at a 6-month follow-up. Six questions we could ask, corresponding to the elements in the measurement taxonomy, are as follows:

· 1. Reliability: Is the score of 20 at baseline the right score for this patient—is it a dependable score value?

· 2. Validity: Is the scale truly measuring the construct depression, or is it measuring something else?

· 3. Interpretation of a score: What does a score of 20 mean? Is it high or low?

· 4. Reliability of change: Is the change from 20 to 15 a real change, or does it merely reflect random fluctuations in measurement?

· 5. Responsiveness: Does the change from 20 to 15 correspond to a commensurate improvement in degree of depression?

· 6. Interpretation of a change score: What does a 5-point improvement mean? Is the improvement large enough to be considered clinically significant?