Review of the Overeating Questionnaire by JAMES P. DONNELLY, Assistant Professor, Department of Counseling, School & Educational Psychology, University at Buffalo, Amherst, NY:
DESCRIPTION. The Overeating Questionnaire (OQ) is an 80-item self-report measure of attitudes and behaviors related to obesity. In the test manual, the authors indicated that the OQ was developed to meet a growing need for a comprehensive measure useful in the treatment of obesity, especially in individualized treatment planning. They also noted that the wide age range covered by the norms for the measure meets the increasing need for assessment of children and adolescents in weight-loss programs. Users are advised that the test is not intended to be used in diagnosis of eating disorders such as anorexia or more general mental health issues like depression.
The measure includes two validity scales (Inconsistent Responding and Defensiveness) as well as 10 clinically oriented scales. The six clinical scales specifically related to eating include: Overeating, Undereating, Craving, Expectations about Eating, Rationalizations, and Motivation to Lose Weight. The remaining four clinical scales address more general health-related issues thought to be central to weight loss treatment, including Health Habits, Body Image, Social Isolation, and Affective Disturbance. The measure also includes 14 items related to patient identity, demographics, weight, and general health behavior.
The OQ can be completed via paper form or computer, and can be administered by a technician. Interpretation of results, which include raw scores, normalized T scores, percentiles, and a graphic profile plot, should be done by a professional with competence in psychometrics sufficient to be able to read and understand the test manual. Time for test completion is said to average about 20 minutes and requires a fourth-grade reading level. The paper or "autoscore" version is printed on a cleverly designed form that integrates all items, scoring instructions and worksheet, and a scoring page (or "profile sheet") that includes raw score, percentile, and T score equivalents. Hand scoring on the worksheet is facilitated by a combination of arrows, boxes, and shading, which makes the computation of raw scale scores relatively quick and easy. The profiling of scores facilitates efficient visual identification of relative strengths and vulnerabilities, but is not intended for classification of subtypes of test takers. The computer version of the test was not available for this review; however, the manual provides a description and a sample report.
DEVELOPMENT. The development process appears to have generally followed accepted scale development practices (e.g., DeVellis, 2003), though some irregularities in the manual report cause concern. Item development and evaluation included two sequences of literature review, data collection, and item and scale analysis. No specific theory was cited. Following an initial literature review, 140 items thought to be related to overeating and responsiveness to weight loss interventions were written. Constructs represented in this item set included attitudes toward weight, food, eating, and self-image. Items reflecting defensiveness and general psychosocial functioning were also included. The initial item set was studied in a sample of convenience in a university medical school setting (no other description of the participants or their number is given). Based on examination of correlations, 129 items were retained, supplemented by an additional 59 new items generated from feedback from the pilot sample and additional literature review. The second item set was evaluated based on responses of 140 nursing students. The manual notes that the scale structure based on the new data was generally similar to the original set with two minor exceptions, yet no specifics on how scale structure was studied are given. For final inclusion, an item had to correlate at least .30 with its intended scale, and had to show discrimination of at least .10 greater correlation with its own versus any other scale. In addition, final decisions were made with regard to item readability and content uniqueness, resulting in the final 80-item set.
As noted, there are two validity scales, Inconsistent Responding Index (INC) and Defensiveness (DEF). The INC scale includes 15 pairs of items with correlations of .5 or greater in the standardization sample. The scale is scored by counting all of the item pairs in which the response differed by at least 2 scale points. The test authors computed the average INC score for 200 randomly generated scores to provide an interpretive guide vis-à-vis the probability that an INC score reflects random responding. For example, an INC score of 5 is associated with a 71% likelihood that the scale was completed randomly. The Defensiveness scale includes seven items representing idealized self-evaluations (e.g., "I am always happy"). Relatively less information is provided on this scale, except that T scores above 60 are said to suggest caution in interpretation and reassurance for anyone completing the scale in the context of treatment.
TECHNICAL.
Standardization. The standardization sample of 1,788 was recruited nationally from schools and community settings. A table of breakdowns by gender, age, race/ethnicity, education, and region are provided with national proportions for each variable for comparison, with the exception of age (perhaps because the categories used for the test were not comparable to U.S. Census records, though no explanation is given). Overall, as the test authors noted, the sample resembles national data with some underrepresentation of males and some minority groups. The sample data were then transformed to normalized T scores, which were the basis for both the examination of subgroup differences and for the clinical scoring procedures.
The analysis of subgroups involved inspection of means with interpretation of differences guided by a general statement regarding effect sizes (.1-.3 = small, .3-.5 = moderate, greater than .5 = large). The use of effect sizes as an interpretive guide is laudable, but more specific reference to the meaningfulness of these numbers in the context of obesity research and treatment would be a significant improvement. For example, some of the subscales may represent attitudes and behaviors that are more difficult to change in treatment than others; some scales may be more stable following treatment than others; and some may be more highly correlated with other treatment outcomes such as Body Mass Index, any of which would significantly affect interpretation. We can hope that future research provides such data. Nevertheless, the tables indicate that most of the subgroup mean differences are less than the 3 T-score points the authors suggest is the upper limit of a small effect. The differences beyond this level are noted in text, and further research is acknowledged as important in these instances. The overall conclusion that the subgroup differences are minimal simplifies the matter of scoring and interpretation because the T-score norms essentially become a "one size fits all" scoring protocol, a trade of simplicity for specificity that may be welcomed in the clinical setting on purely practical grounds, but cannot be said to reflect strong evidence-based assessment at this point in time.
Reliability. Reliability data for the OQ are presented in terms of internal consistency for the standardization sample, and 1-week test-retest reliability for a separate group. The coefficient alpha estimates for the 10 clinical scales and the Defensiveness scale show evidence of strong internal consistency, with a range of .79 to .88 across the subscales for the full sample. Interestingly, the test authors separately examined internal consistency for the 68 children aged 9 or 10 in the sample. For this group, one scale (Health Habits) dipped below .70 (to .66), but otherwise the reliability estimates remained reasonably strong (range = .72-.88). In the same table, the authors also provided corrected median item-total correlations for the items in each scale, along with ranges for these estimates. Again, the evidence points toward desirable internal consistency. The 1-week test-retest data are also strong if we merely examine the range of the estimates (.64-.94), but is much more limited when taking into account the small number in this sample (n = 24), the fact that no information is given about the sample, and the absence of any theoretical or other comment on why this interval was chosen or whether the constructs measured by the scales should be stable over this interval.
Validity. The manual reports evidence of construct validity that reflects internal and external validity characteristics of the scales. The internal validity report includes tables of scale intercorrelations as well as the results of a principal components analysis on the standardization sample. The external validity data include correlations with a number of other scales and variables chosen to reflect plausible relationships that would provide convergent and divergent validity evidence.
The table of intercorrelations and the accompanying interpretive text are consistent with previously described internal structure of the measure. The principal components analysis was conducted separately for seven scales measuring vulnerabilities (e.g., Overeating) and the remaining three measuring strengths (e.g., Motivation to Lose Weight). The table reporting this analysis includes only the component loadings. No other information on important details of the analysis that should typically be reported is given (e.g., rotation, extraction criteria, eigenvalues) (Henson & Roberts, 2006). The authors noted that the loadings are generally consistent with indicated scales, though, for example, two clearly distinct but adjoining components are combined in a single scale.
Additional construct validity data are presented in the form of correlational studies further examining the relationship of OQ scales to person characteristics such as BMI in the standardization sample, and a small sample (N = 50) study of OQ correlations with five previously established self-report measures of related constructs (e.g., eating, self-concept, stress). In addition, a study of Piers-Harris Self-Concept and OQ scores for 268 of the "youngsters" from the standardization sample was mentioned (no other information is given on this subsample). The authors' conclusion that the overall pattern is consistent with expectations given the nature of the OQ scale constructs is quite global but not unreasonable.
COMMENTARY. Strengths of the OQ include the efficiency of a single instrument for virtually anyone who might be seen in treatment, ease of administration and scoring, attention to response style, inclusion of specific eating and more general health behaviors, a reasonably large standardization sample of children and adults, internal consistency reliability, face validity, and some evidence of construct validity. The question of to what extent the standardization sample resembles the likely clinical population is not directly addressed. A case could be made that the sample is, in fact, a good comparison one because a large proportion of the U.S. population is overweight and at some point may seek professional assistance. The use of effect sizes in interpretation is commendable, but should eventually be more specifically associated with clinical data in the intended population in future versions of the scale. In addition, some details of the measure development process are missing from the manual (e.g., minimal reporting of the pilot samples, few details of the principal-components analysis).
SUMMARY. The OQ is a relatively new measure attempting to address a major health issue with a comprehensive and efficient set of scales intended for use in individualized treatment of overeating. The test manual sets a relatively circumscribed goal of aiding in individual treatment planning, but that process must be undertaken without the benefit of any predictive data. The OQ ambitiously attempts to provide a single measure for children through older adults with a single set of norms. In providing a user-friendly format and some good psychometric evidence, it is potentially useful in the expressed goal of aiding in treatment planning. Further research is needed to enhance the clinician's ability to confidently employ the measure, especially in understanding the relationship of scores and profile patterns to treatment process and outcome.
REVIEWER'S REFERENCES
DeVellis, R. F. (2003). Scale development. Thousand Oaks, CA: Sage.
Henson, R. K., & Roberts, J. K. (2006). Use of exploratory factor analysis in published research: Some common errors and some comment on improved practice. Educational and Psychological Measurement, 66, 393-416.
Review of the Overeating Questionnaire by SANDRA D. HAYNES, Dean, School of Professional Studies, Metropolitan State College of Denver, Denver, CO:
DESCRIPTION. The Overeating Questionnaire is an 80-item self-report questionnaire designed to measure key habits, thoughts, and attitudes related to obesity in order to establish individualized weight loss programs. Such an instrument is rare as tests of eating behavior are typically geared toward anorexia nervosa and bulimia nervosa. The paper-and-pencil version of the questionnaire can be administered individually or in a group and takes approximately 20 minutes to complete. The administration time for the PC version is similar but, as suggested, administration is accomplished using computer keyboard and mouse. After completing identifying information including age, gender, education, and race/ethnicity, examinees are asked to answer questions in Part I regarding height, historical weight and eating patterns, use of alcohol and drugs, health problems, and perceptions of weight in self and others. Part II consists of a list of 80 statements that the examinee is asked to rate with regard to agreement on a 5-point scale: Not at all (0), A little bit (1), Moderately (2), Quite a lot (3), and Extremely (4). Care should be taken to ensure that clients respond to all statements on the questionnaire. If an item has been left blank and an answer cannot be obtained from the client, the median score for that item is used in scoring. No written instructions are given to the client regarding the correction of responses made in error. The sample scoring sheet shows errors being crossed out. Verbal instruction should be given.
Scoring is manual using the paper-and-pencil AutoScore(tm) form or computerized using the PC version. Using the AutoScore(tm) form, responses are automatically transferred to an easy score worksheet. Raw scores for each question are transferred to a box under the appropriate scale heading. Numbers from columns representing each of 11 scales are then summed and transferred to the profile sheet. The profile sheet contains corresponding normalized T-scores and percentiles, and provides a graphic representation of results. Scores greater than or equal to 60T are considered high; greater than or equal to 70T are very high. Scores less than or equal to 40T are considered low. A 12th score, the Inconsistent Responding Index (INC), is calculated by finding the differences between 15 INC similar item pairs.
Remarkably little attention is paid to the computerized scoring in the text of the manual. (It is described in an appendix.) Using this method, the client uses a computer to complete the questionnaire. Scoring is quicker and multiple tests can be scored at the same time. An interpretive report is automatically produced. Even so, care should be taken to ensure accuracy of the report.
As mentioned, 12 scores are generated from the questionnaire. Of the 12 scores, 2 are validity scores. These are Inconsistent Responding (INC) and Defensiveness (DEF). Using INC, an inconsistency is noted if the difference between the paired items is greater than or equal to 2. There is no absolute cutoff score for a high INC score. An INC of 5 or more indicates a 71% probability of random or careless responding. Clients should be queried about their distractibility during test taking. The results of the INC score should be discussed in the interpretative report. The DEF score corresponding to items is indicative of an idealized self. If the DEF score is elevated, accuracy of responding to the questionnaire as a whole is questionable.
Of the 10 remaining scores, 6 of the scores are classified under the category Eating-Related Habits and Attitudes. This cluster of scores identifies positive and negative habits and attitudes that enhance or interfere with maintenance of healthy body weight. These scores are: Overeating (OVER), Undereating (UNDER), Craving (CRAV), Expectations About Eating (EXP), Rationalizations (RAT), and Motivation to Lose Weight (MOT). The 4 remaining scales are classified as General Health Habits and Psychosocial Functioning. These scores are: Health Habits (HEAL), Body Image (BODY), Social Isolation (SOCIS), and Affective Disturbance (AFF). This cluster of scores identifies positive and negative aspects of the environment that enhance or interfere with the maintenance of healthy body weight. Taken together, these scores are designed to help the clinician and client develop an effective, personalized weight reduction plan.
DEVELOPMENT. The OQ was formulated after extensive literature review, creation of an initial item pool of 140 items, and modification of the item pools and scales in two pilot tests. The initial items were related to attitudes toward weight, food and eating, self-image, and defensive response. Related questions were placed into different scales as they were identified in the pilot testing process. The 80-item questionnaire was derived from an intercorrelation evaluation of "fit" within the scales and from feedback from respondents. The INC score was incorporated after the final 80 questions were decided upon by correlation of item pairs. Pairs with a correlation of .50 or higher in the standardization sample were included in the sample. Readability was taken into consideration and the reading level for the final form is fourth grade.
TECHNICAL.
Standardization. A standardization sample of 1,788 individuals ranging in age from 9 to 98 from public, nonclinical settings (such as public schools) was used to standardize the OQ. Males, persons of color, and those with less education were somewhat underrepresented. Nonetheless, the authors examined differences among gender, ethnicity, age, education, and region of the United States. Standard scores held relatively true for these demographic variables. The authors are well aware of the need to continue their research in the area of differences among individuals from various demographic backgrounds.
Reliability. Estimates of internal consistency (coefficient alpha), item-to-scale correlations, and test-retest reliability were examined. All measures of reliability indicate that the OQ is a reliable measure. Specific values are generally acceptable to high with an internal consistency median value of .82 (.77 for respondents aged 9-10), item to scale correlations median value of .55, and test-retest correlation median value of .88. The first two estimates of reliability were conducted using the entire standardization sample. Test-retest reliability used a subgroup of 24 individuals aged 27-64 with a 1-week interval between testing. Further investigation of test-retest reliability is warranted given the small sample size and short retest interval.
Validity. Construct and discriminate validity measures were used to assess the validity of the OQ. Construct validity was evaluated in three ways: interscale correlations, a factor analysis showing the relationships among responses given to test items, and correlations between a scale and other measures of a similar characteristic. The first two measures showed strong evidence that the OQ scales measure unique although sometimes related constructs. The third measure indicated good correlation with other measures of similar characteristics and good negative correlation with other measures of opposite characteristics.
Discriminate validity was assessed in two ways. First, three subgroups from the standardization sample who indicated in one of three ways they were overweight were compared to the overall sample. As expected, individual scores from these groups differed significantly from those without weight problems on most scales. However, females scored differently on more scales than did males. Such a finding underscores the need for further research into gender and other demographic differences in scoring. Second, the standardization sample was compared to a group of individuals who were in treatment for mood disorders. All of these individuals were overweight. All but three scores were above average for this group as compared to the standardization group.
COMMENTARY. The major strength of the OQ is its measurement of the key habits, thoughts, and attitudes related to obesity in order to establish individualized weight loss programs. Thus, not only does the questionnaire focus on an important, yet often neglected area of eating disorders-obesity-it appears that it may be a useful instrument in the development of personalized weight loss programs. The efficacy of the latter claim needs further research, however. Administration and scoring are straightforward and the ability to administer the OQ to individuals or to a group is a plus.
The manual is well organized and is easy to read. Psychometric concepts are explained prior to giving the specific measures of the OQ and were well evaluated. More supporting interpretive comments would make the test more useful in clinical situations.
SUMMARY. The OQ appears to be a well-researched measure of factors that influence obesity. More research is needed in the efficacy of the instrument in establishing effective treatment protocols.