personality

CreamSodaLover
validity.docx

FOR THE WEBSITE: https://access.infobase.com/article/2222407-reliability-and-validity?rak=1

Glossary

Quantitative variable

Numerical capturing of the extent or intensity of a characteristic of an observational unit.

Observational unit

The entity on which information and data is being gathered.

Theoretical construct

A hypothetical characteristic of an observational unit that is not directly observable.

Operationalization

The set of rules (e.g., used measurement instruments and the instructions for their application and evaluation) by which observational units are assigned to values of a variable representing a theoretical construct.

Measurement error

Random discrepancy between the hypothetical true value of the variable to be measured and the observed values of the variable.

Correlation

Statistical measure that describes the strength of association between two variables.

Reliability

Reliability refers to the precision of a particular measurement. In this article, we consider reliability with regard to the measurement of quantitative variables with the aim to study differences between observational units. Observational units can be, for example, countries, districts, municipalities, households, or persons. Examples of variables from these observational units may include poverty, segregation, integration, and political attitudes. Variables measured in social science are typically theoretical constructs, that is, hypothetical phenomena that are not directly assessable using physical measurement instruments. Throughout the article, we use the construct xenophobia as an example of an unobservable personal attitude.

In order to measure a construct in social science, a researcher first needs to define what exactly he or she wants to assess. Depending on the complexity of the construct to be measured, measurement might require an elaborate theoretical framework. In a second step, the researcher has to determine how this variable can be assessed through observations such as expert ratings, tests, surveys, or questionnaires, that is to say, any form of observable indicators that allow quantifying the variable of interest. This so-called operationalization thus determines which observable actions, conditions, or states are indicative of the construct. Most measurement instruments applied in social sciences include multiple indicators (e.g., questionnaire items) that are combined to measure one variable (e.g., by summing up the values from responses to multiple items). In our example, xenophobia could be assessed by a questionnaire entailing multiple statements that all revolve around the fear or dislike of strangers and foreigners. The sum score across all the given responses of a person represents his or her value on the variable xenophobia.

The combined measures are usually assumed to contain some amount of measurement error. Measurement error is defined as a random discrepancy between the hypothetical true value of the variable to be measured and the obtained values of the variable. A reliable measurement instrument measures a variable precisely, that is, with only a small amount of measurement error. Technically, reliability is defined as the proportion of variance in the observed measures that can be attributed to true differences between observational units. The true value can be conceptualized as the hypothetical mean value that would be obtained if the measurement were repeated infinitely. According to the technical definition, reliability can range between 0 and 1, where a coefficient of 0 means that the observed variance consists only of measurement error and that the measurement instrument is unreliable. A correlation coefficient of 1 indicates error-free measurement. The reliability coefficients mentioned in the following can be interpreted accordingly. Note that in practice, negative coefficients are also possible and point to unreliable measures.

Establishing Reliability

Whether the measurement error of an instrument is small can be established in different ways. Since reliability is defined as the precision of a measurement instrument, the reliability of an instrument is evaluated by determining the variance of the measured variable that is due to true differences between the observational units and the variance that occurred due to measurement error. The larger the proportion of true variance in the observed (i.e., total) variance, the more reliable is the measurement instrument. The true variance can be assessed in different ways and typically involves establishing the consistency of the measurement. The three main methods presented here are (1) uniformity of rater assessments; (2) consistency across multiple indicators; and (3) consistency across multiple measurement occasions. Since various methods for estimating reliability exist, it is vital for researchers to formulate expectations regarding the consistency of the measured variable. Are all indicators expected to contribute similarly to the measured variable? Under which conditions and in what time period is the measurement expected to show the same results? This formulation entails defining under which conditions changes in measurement are regarded as error. Once error has been established, the researcher can decide on the type of method with which to evaluate reliability. For example, when assessing xenophobia with different items, we could aim for the items to show a high internal consistency. We might also expect the construct xenophobia to remain relatively stable within a short time interval in which no external events occur that would have an effect on a person’s xenophobia. Note that this latter assumption is rather difficult to verify, since not all events that happen to individuals taking the questionnaire can be recorded.

In order to minimize measurement error, it is vital to eliminate other sources of error. Standardization of the measurement plays a key role in eliminating sources of error by keeping procedures, materials, conditions, and rules of scoring constant. For example, when assessing xenophobia in different districts of a city, the assessment should take place simultaneously in order to eliminate external influences on the outcome, such as political happenings on the days of assessment. Standardization fosters objectivity of the measurement, which is considered a necessary condition to obtaining reliable measures.

Reliability Across or Between Raters

In some instances, mapping the observed indicator onto a scale in order to distil the relevant information requires human coding. Coding procedures are typically used for observational studies, audio recordings, or text analyses. Human coders are given training or written instructions on how to rate the behavior they see, words they hear, or text they read. Coders might, for example, be presented with programs of political parties and asked to rate them in terms of xenophobic content.

To achieve reliability in the coding, a first step involves proper training for all coders and straightforward instructions. To evaluate whether the behavior or text is coded consistently, the coding material can be represented to multiple coders or the same coder can be asked to repeat the coding after a certain amount of time. Depending on the number of raters and the type of scale (e.g., categorical data, ranked data, data on an interval scale), different statistical procedures for determining inter- or intrarater reliability can be used: the intraclass correlation coefficient, Cohen’s kappa and Fleiss’ kappa, Krippendorff’s alpha, or interrater correlations.

Reliability Between Indicators

In cases where a variable is measured using several indicators, the internal consistency of those indicators can be determined. In our example, xenophobia is assessed by a questionnaire that entails statements revolving around the topic fear or dislike of strangers and foreigners; people filling out the questionnaire are asked to mark, on a scale, the extent to which they agree with each statement. The idea is that each of the indicators (i.e., statements) represents an autonomous measure of the same variable and thus contains the assumption that xenophobia is a unidimensional construct. This type of reliability estimation is useful when the assumption of unidimensionality is made. If each of the indicators measures the same variable, the internal consistency should be high. Note, however, that the internal consistency depends on the homogeneity of the indicators. If the indicators cover a rather heterogeneous field in order to cover a broadly defined variable, a high internal consistency might not necessarily be desired. In our example, we could focus on a very narrow context such as discrimination against foreigners and formulate items such as Foreigners should not be allowed to freely choose their place of residence or Foreigners should be banned from any political activity within the country. A more heterogeneous questionnaire could include items also concerning fear of foreigners (e.g., Foreigners take away jobs) or other attitudes toward foreigners (e.g., Foreigners should adapt their lifestyles to those of their fellow citizens). The broader the context is, the less likely that the agreement will be similar among all the indicators.

The most prominent estimates of internal consistency are Cronbach’s alpha, KR-20, the split half coefficient, and McDonald’s omega. The underlying idea of all these methods is to estimate the extent of agreement between different indicators. Variability between indicators hints toward a lack of consistency between the indicators and therefore a lack of reliability of the overall measurement.

Reliability Across Time Points

If a variable is assumed to be constant across measurement occasions, the true variance can be estimated by repeatedly measuring the variable at different time points. The occurring variability across measurement occasions can be considered measurement error. Small variability between the measurement occasions indicates a precise measurement of the variable and hence reliability of the measurement instrument. In our example, a questionnaire that precisely assesses xenophobia should result in similar sum scores if the test were given to the same people on two measurement occasions. As mentioned earlier, assessing retest reliability on xenophobia includes assuming that the respondents’ true xenophobia has not changed during the time period between the two assessments, which is a rather unlikely and also an untestable assumption.

This so-called retest reliability can be estimated via correlation. If a large sample of observational units were measured at the first measurement occasion and this same sample were measured again at the second measurement occasion, the correlation between the values of the first and second measurement occasions informs about the extent to which the measures are identical. In our example, a high correlation indicates that the participants attained very similar sum scores at the first and second measurement occasions; a low correlation means that the sum scores from the first measurement occasion hardly relate to the sum scores from the second measurement occasion.

Reliability Does Not Equal Reliability

It is important to note that reliability estimation from a repeated measurement does not necessarily lead to the same result as the estimation of internal consistency. For example, we might have chosen rather heterogeneous items for assessing xenophobia, which would probably result in a low internal consistency of the questionnaire; if we gave this questionnaire to the same people after only 1 week, the reliability across time points might actually be rather high. The question on how to estimate reliability is thus not always straightforward. In many instances, it is impossible to pick a single method and to consider the estimated reliability coefficient as the unique reliability of the measurement instrument. Instead, the researcher should make clear assumptions regarding which type of reliability should report high results, outlining under which conditions exactly the measurement instrument should show high precision.

Another crucial aspect to consider when establishing reliability is that its estimates are sample specific. Since reliability is defined as the proportion of true variance relative to the total variance, reliability can only be high if the true variance is high, that is, actual differences between the observational units on the quantitative variable need to exist. For example, if we presented the xenophobia questionnaire to a very homogeneous subgroup of people that hardly vary in their true value on the xenophobia variable (e.g., rightwing extremists), the estimated reliability coefficient would be low—due to a lack of observed variance in the sum scores—even if the questionnaire actually reliably measured xenophobia. This sample specificity also means that when a measurement instrument is evaluated in a sample, conclusions regarding its reliability only pertain to that specific sample. A measurement instrument can be perfectly reliable for one specific subpopulation but unreliable in another. For example, the questionnaire for measuring xenophobia might be completely unreliable when given to fifth graders.

Validity

Validity is often defined as the condition that a measurement instrument actually measures the construct that it intends to measure. More precisely, validity is the degree to which empirical evidence and theoretical arguments support the permissibility of interpretations and the usage of measurements. As for reliability, the present definition pertains to the measurement of variables that need to be operationalized by a measurement instrument, for example, human ratings or questionnaires. Validation can be considered a process that involves an ongoing gathering of evidence for the interpretation, evaluation, and uses of a measurement instrument. Putting reliability and validity into relation to each other, reliability is considered a necessary but insufficient prerequisite for a valid measurement. This means that a measurement instrument can be perfectly reliable without actually measuring what the researcher intended to measure. Reliability is thus distinct from validity—it can hold even if validity does not—whereas validity can only hold if reliability holds as well. For example, if the aim were to measure xenophobia and the questionnaire mainly included items such as I do not like traveling abroad, internal consistency might be high, but the fear or dislike of foreigners was not directly captured by the item content as there may be other reasons for having a high score on such items.

Establishing Validity

Validity can only be established by theory-based research. Interpretations and applications of instruments need to be supported through arguments and hypothesis-driven research. The so-called validity argument gives an overall evaluation of the collected evidence to support (or weaken) the validity of a specific measurement value interpretation. It encompasses the following components, which are considered as consecutive steps in the process of validation: (1) specification of the intended measurement value interpretation; (2) identification and formulation of testable propositions on which the measurement value interpretation is based; (3) gathering of evidence for and against the particular propositions; and (4) a comprehensive integration and summarizing evaluation of the gathered evidence.

At the beginning of each validation, a researcher must specify how a measured value is to be interpreted and what this interpretation means with regard to its intended use. Results from questionnaires on xenophobia can, for example, be used to record the general opinion in a given district in order to determine whether counterregulatory measures need to be taken. Validation for one specific interpretation cannot, however, be generalized to other interpretations. This means that each interpretation and use needs to be validated separately. Validation therefore requires not only a clear definition of the measurement object, that is, the construct and its operationalization, but also a clear specification of the intended use and interpretation of the measurement.

In the second step, testable propositions on the interpretation of measurement values should be identified and described. This description should also entail reasoned arguments that the propositions formulated are exhaustive, meaning that no relevant proposition has been left out. Further, propositions need to be formulated such that they can be refuted via empirical evidence. A suitable method to ensure this requirement is to generate competing propositions, that is, propositions that would conflict with the intended interpretation of the measured value. A proposition can be considered relevant when the rebuttal of the proposition affects the informative value of the measurement. In our example, the proposition that our questionnaire actually measures xenophobia would not hold if people with high values on the questionnaire (i.e., high xenophobia) vote for parties that are more liberal and welcoming to foreigners and people with low values on the questionnaire vote for parties with xenophobic political content. Another criterion for identifying relevant propositions is the representativeness of the measured indicators with respect to the unobserved construct or phenomenon. Problems regarding representativeness can occur in two ways: either (1) the indicators are too few and only measure a restricted part of the intended construct (construct underrepresentation) or (2) indicators are included that measure irrelevant information not part of the construct (construct irrelevant variance). These two possibilities are vital to consider when defining the observable indicators and when identifying the testable propositions.

After the formulation of testable propositions, the third step involves testing those propositions, which can include various sources of information such as the content of the included indicators (e.g., expert opinions), underlying response processes during measurement (e.g., vocalized thought processes while answering survey questions; recording eye movements), the internal structure of the collected data (e.g., analyses on dimensionality), and relationships between the measured variable and other variables (e.g., estimating correlations; predicting external criteria). In our example, the questionnaire on xenophobia could be validated by using it to predict election results. How people vote would serve as an external criterion and the questionnaire should predict the result insofar that people with higher xenophobia measures vote for parties with xenophobic political contents. The internal structure could be established by testing the unidimensionality assumption.

Last, the evidence from various sources for the propositions needs to be collected and comprehensively evaluated. The interpretation of the measured value is justified if none of the propositions could be refuted by the evidence. This also means that it is always possible that evidence gathered in the future can falsify a proposition and thus call for a renewed evaluation of the validity argument. In this way, validity of a measurement instrument is an ongoing process that evolves over time.

See Also: Hypothesis Testing; Quantitative Methodologies; Questionnaire Survey; Sampling; Scale; Uncertainty.

Further Reading

American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 2014. Standards for Educational and Psychological Testing. American Educational Research Association Washington, DC.

Baxter, J.; Eyles, J., 1997. Evaluating qualitative research in social geography: establishing ‘rigour’ in interview analysis. Trans. Inst. Br. Geogr. 22, 505-525.

Kane, M.T., 2013. Validating the interpretations and uses of test scores. J. Educ. Meas. 50, 1-73.

Zinbarg, R.E.; Revelle, W.; Yovel, I.; Li, W., 2005. Cronbach’s, a Revelle’s β and McDonald’s u H: their relations with each other and two alternative conceptualizations of reliability. Psychometrika 70 (1), 123-133. https://doi.org/10.1007/s11336-003-0974-7.