Debating Ability Testing
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 1/79
CHAPTER 6 Group Tests and Controversies in Ability Testing
TOPIC 6A Group Tests of Ability and Related Concepts
6.1 Nature, Promise, and Pitfalls of Group Tests (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec1#ch06lev1sec1)
6.2 Group Tests of Ability (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06lev1sec2)
6.3 Multiple Aptitude Test Batteries (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06lev1sec3)
6.4 Predicting College Performance (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec4#ch06lev1sec4)
6.5 Postgraduate Selection Tests (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec5#ch06lev1sec5)
6.6 Educational Achievement Tests (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev1sec6)
The practical success of early intelligence scales such as the 1905 Binet-Simon test motivated psychologists and educators to develop instruments that could be administered simultaneously to large numbers of examinees. Test developers were quick to realize that group tests allowed for the efficient evaluation of dozens or hundreds of examinees at the same time. As reviewed in an earlier chapter, one of the first uses of group tests was for screening and assignment of military personnel during World War I. The need to quickly test thousands of Army recruits inspired psychologists in the United States, led by Robert M. Yerkes, to make rapid advances in psychometrics and test development (Yerkes, 1921 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1796) ). Many new applications followed immediately—in education, industry, and other fields. In Topic 6A (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06#ch06box1) , Group Tests of Ability and Related Concepts, we introduce the reader to the varied applications of group tests and also review a sampling of typical instruments. In addition, we explore a key question raised by the consequential nature of these tests—can examinees boost their scores significantly by taking targeted test preparation courses? This is but one of many unexpected issues raised by the widespread use of group tests. In Topic 6B (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev2sec21) , Test Bias and Other Controversies, we continue a reflective theme by looking into test bias and other contentious issues in testing.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 2/79
6.1 NATURE, PROMISE, AND PITFALLS OF GROUP TESTS Group tests serve many purposes, but the vast majority can be assigned to one of three types: ability, aptitude, or achievement tests. In the real world, the distinction among these kinds of tests often is quite fuzzy (Gregory, 1994a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib646) ). These instruments differ mainly in their functions and applications, less so in actual test content. In brief, ability tests typically sample a broad assortment of proficiencies in order to estimate current intellectual level. This information might be used for screening or placement purposes, for example, to determine the need for individual testing or to establish eligibility for a gifted and talented program. In contrast, aptitude tests usually measure a few homogeneous segments of ability and are designed to predict future performance. Predictive validity is foundational to aptitude tests, and often they are used for institutional selection purposes. Finally, achievement tests assess current skill attainment in relation to the goals of school and training programs. They are designed to mirror educational objectives in reading, writing, math, and other subject areas. Although often used to identify educational attainment of students, they also function to evaluate the adequacy of school educational programs.
Whatever their application, group tests differ from individual tests in five ways:
Multiple-choice versus open-ended format Objective machine scoring versus examiner scoring Group versus individualized administration Applications in screening versus remedial planning Huge versus merely large standardization samples
These differences allow for great speed and cost efficiency in group testing, but a price is paid for these advantages.
Although the early psychometric pioneers embraced group testing wholeheartedly, they recognized fully the nature of their Faustian bargain: Psychologists had traded the soul of the individual examinee in return for the benefits of mass testing. Whipple (1910 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1752) ) summed up the advantages of group testing but also pointed to the potential perils:
Most mental tests may be administered either to individuals or to groups. Both methods have advantages and disadvantages. The group method has, of course, the particular merit of economy of time; a class of 50 or 100 children may take a test in less than a fiftieth or a hundredth of the time needed to administer the same test individually. Again, in certain comparative studies, e.g., of the effects of a week’s vacation upon the mental efficiency of school children, it becomes imperative that all S’s should take the tests at the same time. On the other hand, there are almost sure to be some S’s in every group that, for one reason or another, fail to follow instructions or to execute the test to the best of their ability. The individual method allows E to detect these cases, and in general, by the exercise of personal supervision, to gain, as noted above, valuable information concerning S’s attitude toward the test.
In sum, group testing poses two interrelated risks: (1) some examinees will score far below their true ability, owing to motivational problems or difficulty following directions and (2) invalid scores will not be recognized as such, with undesirable consequences for these atypical examinees. There is really no simple way to entirely avoid these risks, which are part of the trade-off for the efficiency of group testing.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 3/79
However, it is possible to minimize the potentially negative consequences if examiners scrutinize very low scores with skepticism and recommend individual testing for these cases.
We turn now to an analysis of group tests in a variety of settings, including cognitive tests for schools and clinics, placement tests for career and military evaluation, and aptitude tests for college and postgraduate selection.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 4/79
6.2 GROUP TESTS OF ABILITY
Multidimensional Aptitude Battery-II (MAB-II) The Multidimensional Aptitude Battery-II (MAB-II; Jackson, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib820) ) is a recent group intelligence test designed to be a paper-and-pencil equivalent of the WAIS-R. As the reader will recall, the WAIS-R is a highly respected instrument (now replaced by the WAIS-III), in its time the most widely used of the available adult intelligence tests. Kaufman (1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib861) ) noted that the WAIS-R was “the criterion of adult intelligence, and no other instrument even comes close.” However, a highly trained professional needs about 1½ hours just to administer the Wechsler adult test to a single person. Because professional time is at a premium, a complete Wechsler intelligence assessment— including administration, scoring, and report writing—easily can cost hundreds of dollars. Many examiners have long suspected that an appropriate group test, with the attendant advantages of objective scoring and computerized narrative report, could provide an equally valid and much less expensive alternative to individual testing for most persons.
The MAB-II was designed to produce subtests and factors parallel to the WAIS-R but employing a multiple- choice format capable of being computer scored. The apparent goal in designing this test was to produce an instrument that could be administered to dozens or hundreds of persons by one examiner (and perhaps a few proctors) with minimal training. In addition, the MAB-II was designed to yield IQ scores with psychometric properties similar to those found on the WAIS-R. Appropriate for examinees from ages 16 to 74, the MAB-II yields 10 subtest scores, as well as Verbal, Performance, and Full Scale IQs.
Although it consists of original test items, the MAB-II is mainly a sophisticated subtest-by-subtest clone of the WAIS-R. The 10 subtests are listed as follows:
Verbal Performance
Information Digit Symbol
Comprehension Picture Completion
Arithmetic Spatial
Similarities Picture Arrangement
Vocabulary Object Assembly
The reader will notice that Digit Span from the WAIS-R is not included on the MAB-II. The reason for this omission is largely practical: There would be no simple way to present a Digit-Span-like subtest in paper- and-pencil format. In any case, the omission is not serious. Digit Span has the lowest correlation with overall WAIS-R IQ, and it is widely recognized that this subtest makes a minimal contribution to the measurement of general intelligence.
The only significant deviation from the WAIS-R is the replacement of Block Design with a Spatial subtest on the MAB-II. In the Spatial subtest, examinees must mentally perform spatial rotations of figures and select one of five possible rotations presented as their answer (Figure 6.1 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig1) ). Only mental rotations are involved (although “flipped-over” versions of the original stimulus are included as distractor items). The advanced items are very complex and demanding.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 5/79
The items within each of the 10 MAB-II subtests are arranged in order of increasing difficulty, beginning with questions and problems that most adolescents and adults find quite simple and proceeding upward to items that are so difficult that very few persons get them correct. There is no penalty for guessing and examinees are encouraged to respond to every item within the time limit. Unlike the WAIS-R in which the verbal subtests are untimed power measures, every MAB-II subtest incorporates elements of both power and speed: Examinees are allowed only seven minutes to work on each subtest. Including instructions, the Verbal and Performance portions of the MAB-II each take about 50 minutes to administer.
The MAB-II is a relatively minor revision of the MAB, and the technical features of the two versions are nearly identical. A great deal of psychometric information is available for the original version, which we report here. With regard to reliability, the results are generally quite impressive. For example, in one study of over 500 adolescents ranging in age from 16 to 20, the internal consistency reliability of Verbal, Performance, and Full Scale IQs was in the high .90s. Test–retest data for this instrument also excel. In a study of 52 young psychiatric patients, the individual subtests showed reliabilities that ranged from .83 to .97 (median of .90) for the Verbal scale and from .87 to .94 (median of .91) for the Performance scale (Jackson, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib817) ). These results compare quite favorably with the psychometric standards reported for the WAIS-R.
Factor analyses of the MAB-II are broadly supportive of the construct validity of this instrument and its predecessor (Lee, Wallbrown, & Blaha, 1990 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib963) ). Most recently, Gignac (2006 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib588) ) examined the factor structure of the MAB-II using a series of confirmatory factor analyses with data on 3,121 individuals reported in Jackson (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib820) ). The best fit to the data was provided by a nested model consisting of a first-order general factor, a first-order Verbal Intelligence factor, and a first-order Performance Intelligence factor. The one caveat of this study was that Arithmetic did not load specifically on the Verbal Intelligence factor independent of its contribution to the general factor.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 6/79
FIGURE 6.1 Demonstration Items from Three Performance Tests of the Multidimensional Aptitude Battery-II (MAB)
Source: Reprinted with permission from Jackson, D. N. (1984a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib817) ). Manual for the Multidimensional Aptitude Battery. Port Huron, MI: Sigma Assessment Systems, Inc. (800) 265–1285.
Other researchers have noted the strong congruence between factor analyses of the WAIS-R (with Digit Span removed) and the MAB. Typically, separate Verbal and Performance factors emerge for both tests (Wallbrown, Carmin, & Barnett, 1988 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1707) ). In a large sample of inmates, Ahrens, Evans, and Barnett (1990 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib12) ) observed validity- confirming changes in MAB scores in relation to education level. In general, with the possible exception that Arithmetic does not contribute reliably to the Verbal factor, there is good justification for the use of separate Verbal and Performance scales on this test.
In general, the validity of this test rests upon its very strong physical and empirical resemblance to its parent test, the WAIS-R. Correlational data between MAB and WAIS-R scores are crucial in this regard. For 145 persons administered the MAB and WAIS-R in counterbalanced fashion, correlations between subtests ranged from .44 (Spatial/Block Design) to .89 (Arithmetic and Vocabulary), with a median of .78. WAIS-R and MAB IQ correlations were very healthy, namely, .92 for Verbal IQ, .79 for Performance IQ, and
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 7/79
.91 for Full Scale IQ (Jackson, 1984a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib817) ). With only a few exceptions, correlations between MAB and WAIS-R scores exceed those between the WAIS and the WAIS- R. Carless (2000 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib269) ) reported a similar, strong overlap between MAB scores and WAIS-R scores in a study of 85 adults for the Verbal, Performance, and Full Scale IQ scores. However, she found that 4 of the 10 MAB subtests did not correlate with the WAIS-R subscales they were designed to represent, suggesting caution in using this instrument to obtain detailed information about specific abilities.
Chappelle et al. (2010) obtained MAB-II scores for military personnel in an elite training program for AC- 130 gunship operators. The officers who passed training (N = 59) and those who failed training (N = 20) scored above average (mean Full Scale IQs of 112.5 and 113.6, respectively), but there were no significant differences between the two groups on any of the test indices. This is a curious result insofar as IQ typically demonstrates at least mild predictive potential for real world vocational outcomes. Further research on the MABII as a predictor of real world results would be desirable.
The MAB-II shows great promise in research, career counseling, and personnel selection. In addition, this test could function as a screening instrument in clinical settings, as long as the examiner views low scores as a basis for follow-up testing with an individual intelligence test. Examiners must keep in mind that the MAB-II is a group test and, therefore, carries with it the potential for misuse in individual cases. The MAB- II should not be used in isolation for diagnostic decisions or for placement into programs such as classes for intellectually gifted persons.
A Multilevel Battery: The Cognitive Abilities Test (CogAT) One important function of psychological testing is to assess students’ abilities that are prerequisite to traditional classroom-based learning. In designing tests for this purpose, the psychometrician must contend with the obvious and nettlesome problem that school-aged children differ hugely in their intellectual abilities. For example, a test appropriate for a sixth grader will be much too easy for a tenth grader, yet impossibly difficult for a third grader.
The answer to this dilemma is a multilevel battery, a series of overlapping tests. In a multi-level battery, each group test is designed for a specific age or grade level, but adjacent tests possess some common content. Because of the overlapping content with adjacent age or grade levels, each test possesses a suitably low floor and high ceiling for proper assessment of students at both extremes of ability. Virtually every school system in the United States uses at least one nationally normed multilevel battery.
The Cognitive Abilities Test (CogAT) is one of the best school-based test batteries in current use (Lohman & Hagen, 2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1003) ). A recent revision of the test is the CogAT Multilevel Edition, Form 6, released in 2001. Norms for 2005 also are available. We discuss this instrument in some detail.
The CogAT evolved from the Lorge-Thorndike Intelligence Tests, one of the first group tests of intelligence intended for widespread use within school systems. The CogAT is primarily a measure of scholastic ability but also incorporates a nonverbal reasoning battery with items that bear no direct relation to formal school instruction. The two primary batteries, suitable for students in kindergarten through third grade, are briefly discussed at the end of this section. Here we review the multilevel edition intended for students in 3rd through 12th grade.
The nine subtests of the multilevel CogAT are grouped into three areas: Verbal, quantitative, and nonverbal, each including three subtests. Representative items for the subtests of the CogAT are depicted
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 8/79
in Figure 6.2 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig2) . The tests on the Verbal Battery evaluate verbal skills and reasoning strategies (inductive and deductive) needed for effective reading and writing. The tests on the Quantitative Battery appraise quantitative skills important for mathematics and other disciplines. The Nonverbal Battery can be used to estimate cognitive level of students with limited reading skill, poor English proficiency, or inadequate educational exposure.
For each CogAT subtest, items are ordered by difficulty level in a single test booklet. However, entry and exit points differ for each of eight overlapping levels (A through H). In this manner, grade-appropriate items are provided for all examinees.
The subtests are strictly timed, with limits that vary from 8 to 12 minutes. Each of the three batteries can be administered in less than an hour. However, the manual recommends three successive testing days for younger children. For older children, two batteries should be administered the first day, with a single testing period the next.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev1… 9/79
FIGURE 6.2 Subtests and Representative Items of the Cognitive Abilities Test, Form 6
Note: These items resemble those on the CogAT 6. Correct answers: 1: B. yogurt (the only dairy product). 2: D. swim (fish swim in the ocean). 3: E. bottom (the opposite of top). 4: A. I is greater than II (4 is greater than 2). 5: C. 26 (the algorithm is add 10, subtract 5, add 10 . . .). 6: A. −1 (the only answer that fits) 7: A (four-sided shape that is filled in). 8: D (same shape, bigger to smaller). 9: E (correct answer).
Raw scores for each battery can be transformed into an age-based normalized standard score with mean of 100 and standard deviation of 15. In addition, percentile ranks and stanines for age groups and grade level are also available. Interpolation was used to determine fall, winter, and spring grade-level norms.
The CogAT was co-normed (standardized concurrently) with two achievement tests, the Iowa Tests of Basic Skills and the Iowa Tests of Educational Development. Concurrent standardization with achievement measures is a common and desirable practice in the norming of multilevel intelligence tests. The particular virtue of joint norming is that the expected correspondence between intelligence and achievement scores is determined with great precision. As a consequence, examiners can more accurately
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 10/79
identify underachieving students in need of remediation or further assessment for potential learning disability.
The reliability of the CogAT is exceptionally good. In previous editions, the Kuder-Richardson-20 reliability estimates for the multilevel batteries averaged .94 (Verbal), .92 (Quantitative), and .93 (Nonverbal) across all grade levels. The six-month test–retest reliabilities for alternate forms ranged from .85 to .93 (Verbal), .78 to .88 (Quantitative), and .81 to .89 (Nonverbal).
The manual provides a wealth of information on content, criterion-related, and construct validity of the CogAT; we summarize only the most pertinent points here. Correlations between the CogAT and achievement batteries are substantial. For example, the CogAT verbal battery correlates in the .70s to .80s with achievement subtests from the Iowa Tests of Basic Skills.
The CogAT batteries predict school grades reasonably well. Correlations range from the .30s to the .60s, depending on grade level, sex, and ethnic group. There does not appear to be a clear trend as to which battery is best at predicting grade point average. Correlations between the CogAT and individual intelligence tests are also substantial, typically ranging from .65 to .75. These findings speak well for the construct validity of the CogAT insofar as the Stanford-Binet is widely recognized as an excellent measure of individual intelligence.
Ansorge (1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib55) ) has questioned whether all three batteries are really necessary. He points out that correlations among the Verbal, Quantitative, and Nonverbal batteries are substantial. The median values across all grades are as follows:
Verbal and Quantitative .78
Nonverbal and Quantitative .78
Verbal and Nonverbal .72
Since the Quantitative battery offers little uniqueness, from a purely psychometric point of view there is no justification for including it. Nonetheless, the test authors recommend use of all batteries in hopes that differences in performance will assist teachers in remedial planning. However, the test authors do not make a strong case for doing this.
A study by Stone (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1582) ) provides a notable justification for using the CogAT as a basis for student evaluation. He found that CogAT scores for 403 third graders provided an unbiased prediction of student achievement that was more accurate than teacher ratings. In particular, teacher ratings showed bias against Caucasian and Asian American students by underpredicting their achievement scores.
Raven’s Progressive Matrices (RPM) First introduced in 1938, Raven’s Progressive Matrices (RPM) is a nonverbal test of inductive reasoning based on figural stimuli (Raven, Court, & Raven, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) , 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1340) ). This test has been very popular in basic research and is also used in some institutional settings for purposes of intellectual screening.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 11/79
RPM was originally designed as a measure of Spearman’s g factor (Raven, 1938 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1335) ). For this reason, Raven chose a special format for the test that presumably required the exercise of g. The reader is reminded that Spearman defined g as the “eduction of correlates.” The term eduction refers to the process of figuring out relationships based on the perceived fundamental similarities between stimuli. In particular, to correctly answer items on the RPM, examinees must identify a recurring pattern or relationship between figural stimuli organized in a 3 × 3 matrix. The items are arranged in order of increasing difficulty, hence the reference to progressive matrices.
Raven’s test is actually a series of three different instruments. Much of the confusion about validity, factorial structure, and the like stems from the unexamined assumption that all three forms should produce equivalent findings. The reader is encouraged to abandon this unwarranted hypothesis. Even though the three forms of the RPM resemble one another, there may be subtle differences in the problem- solving strategies required by each.
The Coloured Progressive Matrices is a 36-item test designed for children from 5 to 11 years of age. Raven incorporated colors into this version of the test to help hold the attention of the young children. The Standard Progressive Matrices is normed for examinees from 6 years and up, although most of the items are so difficult that the test is best suited for adults. This test consists of 60 items grouped into 5 sets of 12 progressions. The Advanced Progressive Matrices is similar to the Standard version but has a higher ceiling. The Advanced version consists of 12 problems in Set I and 36 problems in Set II. This form is especially suitable for persons of superior intellect.
Large sample U.S. norms for the Coloured and Standard Progressive Matrices are reported in Raven and Summers (1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ). Separate norms for Mexican American and African American children are included. Although there was no attempt to use a stratified random-sampling procedure, the selection of school districts was so widely varied that the American norms for children appear to be reasonably sound. Sattler (1988 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1437) ) summarizes the relevant norms for all versions of the RPM. Raven, Court, and Raven (1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1340) ) produced new norms for the Standard Progressive Matrices, but Gudjonsson (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib663) ) has raised a concern that these data are compromised because the testing was not monitored.
For the Coloured Progressive Matrices, split-half reliabilities in the range of .65 to .94 are reported, with younger children producing lower values (Raven, Court, & Raven, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ). For the Standard Progressive Matrices, a typical split-half reliability is .86, although lower values are found with younger subjects (Raven, Court, & Raven, 1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1338) ). Test–retest reliabilities for all three forms vary considerably from one sample to the next (Raven, 1965 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1336) ; Raven et al., 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ). For normal adults in their late teens or older, reliability coefficients of .80 to .93 are typical. However, for preteen children, reliability coefficients as low as .71 are reported. Thus, for younger subjects, RPM may not possess sufficient reliability to warrant its use for individual decision making.
Factor-analytic studies of the RPM provide little, if any, support for the original intention of the test to measure a unitary construct (Spearman’s g factor). Studies of the Coloured Progressive Matrices reveal
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 12/79
three orthogonal factors (e.g., Carlson & Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ). Factor I consists largely of very difficult items and might be termed closure and abstract reasoning by analogy. Factor II is labeled pattern completion through identity and closure. Factor III consists of the easiest items and is defined as simple pattern completion (Carlson & Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ). In sum, the very easy and the very hard items on the Coloured Progressive Matrices appear to tap different intellectual processes.
The Advanced Progressive Matrices breaks down into two factors that may have separate predictive validities (Dillon, Pohlmann, & Lohman, 1981 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib420) ). The first factor is composed of items in which the solution is obtained by adding or subtracting patterns (Figure 6.3a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig3) ). Individuals performing well on these items may excel in rapid decision making and in situations where part–whole relationships must be perceived. The second factor is composed of items in which the solution is based on the ability to perceive the progression of a pattern (Figure 6.3b (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec2#ch06fig3) ). Persons who perform well on these items may possess good mechanical ability as well as good skills for estimating projected movement and performing mental rotations. However, the skills represented by each factor are conjectural at this point and in need of independent confirmation.
A huge body of published research bears on the validity of the RPM. The early data are well summarized by Burke (1958 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib234) ), while later findings are compiled in the current RPM manuals (Raven & Summers, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1337) ; Raven, Court, & Raven, 1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1338) , 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1339) , 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1340) ). In general, validity coefficients with achievement tests range from the .30s to the .60s. As might be expected, these values are somewhat lower than found with more traditional (verbally loaded) intelligence tests. Validity coefficients with other intelligence tests range from the .50s to the .80s.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 13/79
FIGURE 6.3 Raven’s Progressive Matrices: Typical Items
Also, as might be expected, the correlations tend to be higher with performance than with verbal tests. In a massive study involving thousands of schoolchildren, Saccuzzo and Johnson (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1426) ) concluded that the Standard Progressive Matrices and the WISC-R showed approximately equal predictive validity and no evidence of differential validity across eight different ethnic groups. In a lengthy review, Raven (2000 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1334) ) discusses stability and variation in the norms for the Raven’s Progressive Matrices across cultural, ethnic, and socioeconomic groups over the last 60 years. Indicative of the continuing interest in this venerable instrument, Costenbader and Ngari (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib361) ) describe the standardization of the Coloured Progressive Matrices in Kenya. Further indicating the huge international popularity of the test, Khaleefa and Lynn (2008 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib884) ) provide standardization data for 6- to 11-year-old children in Yemen.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 14/79
Even though the RPM has not lived up to its original intentions of measuring Spearman’s g factor, the test is nonetheless a useful index of nonverbal, figural reasoning. The recent updating of norms was a much- welcomed development for this well-known test, in that many American users were leary of the outdated and limited British norms. Nonetheless, adult norms for the Standard and Advanced Progressive Matrices are still quite limited.
The RPM is particularly valuable for the supplemental testing of children and adults with hearing, language, or physical disabilities. Often these examinees are difficult to assess with traditional measures that require auditory attention, verbal expression, or physical manipulation. In contrast, the RPM can be explained through pantomime, if necessary. Moreover, the only output required of the examinee is a pencil mark or gesture denoting the chosen alternative. For these reasons, the RPM is ideally suited for testing persons with limited command of the English language. In fact, the RPM is about as culturally reduced as possible: The test protocol does not contain a single word in any language. Mills and Tissot (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1160) ) found that the Advanced Progressive Matrices identified a higher proportion of minority children as gifted than did a more traditional measure of academic aptitude (the School and College Ability Test).
Bilker, Hansen, Brensinger, and others (2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib159) ) developed a psychometrically sound 9-item version of the 60-item Standard Progressive Matrices (SPM) test. The short test cuts testing time to a fraction of the full test. Correlations of scores on the 9-item version with the full scale were in the range of .90 to .98, indicating a minimal loss of measurement accuracy. The short SPM promises to be highly useful for research applications.
Perspective on Culture-Fair Tests Cattell’s Culture Fair Intelligence Test (CFIT) and Raven’s Progressive Matrices (RPM) often are cited as examples of culture-fair tests, a concept with a long and confused history. We will attempt to clarify terms and issues here.
The first point to make is that intelligence tests are merely samples of what people know and can do. We must not reify intelligence and overvalue intelligence tests. Tests are never samples of innate intelligence or culture-free knowledge. All knowledge is based in culture and acquired over time. As Scarr (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1444) ) notes, there is no such thing as a culture-free test.
But what about a culture-fair test, one that poses problems that are equally familiar (or unfamiliar) to all cultures? This would appear to be a more realistic possibility than a culture-free test, but even here the skeptic can raise objections. Consider the question of what a test means, which differs from culture to culture. In theory, a test of matrices would appear to be equally fair to most cultures. But in practice, issues of equity arise. Persons reared in Western cultures are trained in linear, convergent thinking. We know that the purpose of a test is to find the single, best answer and to do so quickly. We examine the 3 × 3 matrix from left to right and top to bottom, looking for the logical principles invoked in the succession of forms. Can we assume that persons reared in Nepal or New Guinea or even the remote, rural stretches of Idaho will do the same? The test may mean something different to them. Perhaps they will approach it as a measure of aesthetic progression rather than logical succession. Perhaps they will regard it as so much silliness not worthy of intense intellectual effort. To assume that a test is equally fair to all cultural groups merely because the stimuli are equally familiar (or unfamiliar) is inappropriate. We can talk about degrees of cultural fairness (or unfairness), but the notion that any test is absolutely culture-fair surely is mistaken.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 15/79
6.3 MULTIPLE APTITUDE TEST BATTERIES In a multiple aptitude test battery, the examinee is tested in several separate, homogeneous aptitude areas. Typically, the development of the subtests is dictated by the findings of factor analysis. For example, Thurstone developed one of the first multiple aptitude test batteries, the Primary Mental Abilities Test, a set of seven tests chosen on the basis of factor analysis (Thurstone, 1938 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1647) ).
More recently, several multiple aptitude test batteries have gained favor for educational and career counseling, vocational placement, and armed services classification (Gregory, 1994a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib646) ). Each year hundreds of thousands of persons are administered one of these prominent batteries: the Differential Aptitude Test (DAT), the General Aptitude Test Battery (GATB), and the Armed Services Vocational Aptitude Battery (ASVAB). These batteries either used factor analysis directly for the delineation of useful subtests or were guided in their construction by the accumulated results of other factor-analytic research. The salient characteristics of each battery are briefly reviewed in the following sections.
The Differential Aptitude Test (DAT) The DAT was first issued in 1947 to provide a basis for the educational and vocational guidance of students in grades 7 through 12. Subsequently, examiners have found the test useful in the vocational counseling of young adults out of school and in the selection of employees. Now in its fifth edition (1992), the test has been periodically revised and stands as one of the most popular multiple aptitude test batteries of all time (Bennett, Seashore, & Wesman, 1982 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib134) , 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib135) ). Wang (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1714) ) provides a succinct overview of the test.
The DAT consists of eight independent tests:
Verbal Reasoning (VR) Numerical Reasoning (NR) Abstract Reasoning (AR) Perceptual Speed and Accuracy (PSA) Mechanical Reasoning (MR) Space Relations (SR) Spelling (S) Language Usage (LU)
A characteristic item from each test is shown in Figure 6.4 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fig4) .
The authors chose the areas for the eight tests based on experimental and experiential data rather than relying on a formal factor analysis of their own.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 16/79
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 17/79
FIGURE 6.4 Differential Aptitude Tests and Characteristic Items
In constructing the DAT, the authors were guided by several explicit criteria:
Each test should be an independent test: There are situations in which only part of the battery is required or desired. The tests should measure power: For most vocational purposes to which test results contribute, the evaluation of power—solving difficult problems with adequate time—is of primary concern. The test battery should yield a profile: The eight separate scores can be converted to percentile ranks and plotted on a common profile chart. The norms should be adequate: In the fifth edition, the norms are derived from 100,000 students for the fall standardization, 70,000 for the spring standardization. The test materials should be practical: With time limits of 6 to 30 minutes per test, the entire DAT can be administered in a morning or an afternoon school session. The tests should be easy to administer: Each test contains excellent “warm-up” examples and can be administered by persons with a minimum of special training. Alternate forms should be available: For purposes of retesting, the availability of alternate forms (currently forms C and D) will reduce any practice effects.
The reliability of the DAT is generally quite high, with split-half coefficients largely in the .90s and alternate-forms reliabilities ranging from .73 to .90, with a median of .83. Mechanical Reasoning is an exception, with reliabilities as low as .70 for girls. The tests show a mixed pattern of intercor-relations with each other, which is optimistically interpreted by the authors as establishing the independence of the eight tests. Actually, many of the correlations are quite high and it seems likely that the eight tests reflect a smaller number of ability factors. Certainly, the Verbal Reasoning and Numerical Reasoning tests measure a healthy general factor, with correlations around .70 in various samples.
The manual presents extensive data demonstrating that the DAT tests, especially the VR + NR combination, are good predictors of other criteria such as school grades and scores on other aptitude tests (correlations in the .60s and .70s). For this reason, the combination of VR + NR often is considered an index of scholastic aptitude. Evidence for the differential validity of the other tests is rather slim. Bennett, Seashore, and Wesman (1974 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib133) ) do present results of several follow-up studies correlating vocational entry/success with DAT profiles, but their research methods are more impressionistic than quantitative; the independent observer will find it difficult to make use of their results. Schmitt (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1468) ) notes that a major problem with the battery is the
lack of discriminant validity between the eight subtests. With the exception of the Perceptual Speed and Accuracy test, all of the subscales are highly intercorrelated (.50 to .75). If one wants only a general index of the person’s academic ability, this is fine; if the scores on the subtests are to be used in some diagnostic sense, this level of intercorrelation makes statements about students’ relative strengths and weaknesses highly questionable.
Even so, the revised DAT is better than previous editions. One significant improvement is the elimination of apparent sex bias on the Language Usage and Mechanical Reasoning tests—a source of criticism from earlier reviews. The DAT has been translated into several languages and is widely used in Europe for vocational guidance and research applications (e.g., Nijenhuis, Evers, & Mur, 2000
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 18/79
(http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1237) ; Colom, Quiroga, & Juan-Espinosa, 1999 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib326) ).
A computerized version of the DAT has been available for several years, although its equivalence to the traditional paper and pencil format cannot be taken for granted (Alkhadher, Clarke, & Anderson, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib17) ). We will have more to say about computerized testing in a later section of the book. For now, it will suffice to mention that the psychometric qualities of a test may shift when the mode of administration is changed. Using counterbalanced testing in which examinees completed both versions (half taking the traditional version first, half taking the computerized version first), Alkhadher et al. (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib17) ) found that oil refinery trainees (N = 122) scored higher on one subtest of the computerized version than on the traditional version of the DAT, namely, the Numerical Ability subtest. The researchers conjectured that the computerized version reduced test fatigue, alleviated time pressure, and also provided novelty—thus boosting test performance modestly.
The General Aptitude Test Battery (GATB) In the late 1930s, the U.S. Department of Labor developed aptitude tests to predict job performance in 100 specific occupations. In the 1940s, the department hired a panel of experts in measurement and industrial-organizational psychology to create a multiple aptitude test battery to assess the 100 occupations previously studied and many more. The outcome of this Herculean effort was the General Aptitude Test Battery (GATB), widely acknowledged as the premiere test battery for predicting job performance (Hunter, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib801) ).
The GATB was derived from a factor analysis of 59 tests administered to thousands of male trainees in vocational courses (United States Employment Service, 1970 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1679) ). The interpretive standards have been periodically revised and updated, so the GATB is a thoroughly modern instrument even though its content is little changed. One limitation is that the battery is available mainly to state employment offices, although nonprofit organizations, including high schools and certain colleges, can make special arrangements for its use.
The GATB is composed of eight paper-and-pencil tests and four apparatus measures. The entire battery can be administered in approximately two-and-a-half hours and is appropriate for high school seniors and adults. The 12 tests yield a total of nine factor scores:
General Learning Ability (intelligence) (G). This score is a composite of Vocabulary, Arithmetic Reasoning, and Three-Dimensional Space. Verbal Aptitude (V). Derived from a Vocabulary test that requires the examinee to indicate which two words in a set are either synonyms or antonyms. Numerical Aptitude (N). This score is a composite of both the Computation and Arithmetic Reasoning tests. Spatial Aptitude (S). Consists of the Three-Dimensional Space test, a measure of the ability to perceive two-dimensional representations of three-dimensional objects and to visualize movement in three dimensions. Form Perception (P). This score is a composite of Form Matching and Tool Matching, two tests in which the examinee must match identical drawings.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 19/79
Clerical Perception (Q). A proofreading test called Name Comparison, the examinee must match names under pressure of time. Motor Coordination (K). Measures the ability to quickly make specified pencil marks in the Mark Making test. Finger Dexterity (F). A composite of the Assemble and Disassemble tests, two measures of dexterity with rivets and washers. Manual Dexterity (M). A composite of Place and Turn, two tests requiring the examinee to transfer and reverse pegs in a board.
The nine factor scores on the GATB are expressed as standard scores with a mean of 100 and an SD of 20. These standard scores are anchored to the original normative sample of 4,000 workers obtained in the 1940s. Alternate-forms reliability coefficients for factor scores range from the .80s to the .90s. The GATB manual summarizes several studies of the validity of the test, primarily in terms of its correlation with relevant criterion measures. Hunter (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib801) ) notes that GATB scores predict training success for all levels of job complexity. The average validity coefficient is a phenomenal .62.
The absolute scores are of less interest than their comparison to updated Occupational Aptitude Patterns (OAPs) for dozens of occupations. Based on test results for huge samples of applicants and employees in different occupations, counselors and employers now have access to a wealth of information about score patterns needed for success in a variety of jobs. Thus, one way of using the GATB is to compare an examinee’s scores with OAPs believed necessary for proficiency in various occupations.
Hunter (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib801) ) recommends an alternative strategy based on composite aptitudes (Figure 6.5 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fig5) ). The nine specific factor scores combine nicely into three general factors: Cognitive, Perceptual, and Psychomotor. Hunter notes that different jobs require various contributions of the Cognitive, Perceptual, and Psychomotor aptitudes. For example, an assembly line worker in an automotive plant might need high scores on the Psychomotor and Perceptual composites, whereas the Cognitive score would be less important for this occupation. Hunter’s research demonstrates that general factors dominate over specific factors in the prediction of job performance. Davison, Gasser, and Ding (1996 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib396) ) discuss additional approaches to GATB profile analysis and interpretation.
Van de Vijver and Harsveld (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1685) ) investigated the equivalence of their computerized version of the GATB with the traditional paper-and-pencil version. Of course, only the cognitive and perceptual subtests were compared—tests of motor skills cannot be computerized. They found that the two versions were not equivalent. In particular, the computerized subtests produced faster and more inaccurate responses than the conventional subtests. Their research demonstrates once again that the equivalence of traditional and computerized versions of a test should not be assumed. This is an empirical question answerable only with careful research. Nijenhuis and van der Flier (1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1236) ) discuss a Dutch version of the GATB and its application in the study of cognitive differences between immigrants and majority group members in the Netherlands.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 20/79
FIGURE 6.5 Specific and General Factors on the GATB
The Armed Services Vocational Aptitude Battery (ASVAB) The ASVAB is probably the most widely used aptitude test in existence. This instrument is used by the Armed Services to screen potential recruits and to assign personnel to different jobs and training programs. The ASVAB is also available in a computerized version that is rapidly supplanting the original paper-and-pencil test (Segall & Moreno, 1999). The computerized ASVAB is discussed in more detail at the end of this section. More than 2 million examinees take the ASVAB each year. The current version consists of nine subtests, four of which produce the Armed Forces Qualification Test (AFQT), the common qualifying exam for all services (Table 6.1 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06tab1) ). Alternate- forms reliability coefficients for ASVAB scores are in the mid-.80s to mid-.90s, and test–retest coefficients range from the mid-.70s to the mid.80s (Larson, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ). The one exception is Paragraph Comprehension with a reliability of only .50. The test is well normed on a representative sample of 12,000 persons between the ages of 16 and 23 years. The ASVAB manual reports a median validity coefficient of .60 with measures of training performance.
Decisions about ASVAB examinees are typically based on composite scores, not subtest scores. For example, an Electronics Composite is derived by combining Arithmetic Reasoning, Mathematics Knowledge, Electronics Information, and General Science. Persons scoring well on this composite might be assigned to electronics-related positions. Since the composite scores are empirically derived, new ones can be developed for placement decisions at any time. Composite scores are continually updated and revised.
At one point, the Armed Services relied heavily on the seven composites in the following list (Murphy, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1187) ). The Coding Speed subtest, listed here, is no longer used. The first three constitute academic composites, whereas the remaining are occupational composites. The reader will notice that individual subtests may appear in more than one composite:
Academic Ability: Word Knowledge, Paragraph Comprehension, and Arithmetic Reasoning Verbal: Word Knowledge, Paragraph Comprehension, and General Science Math: Mathematics Knowledge and Arithmetic Reasoning Mechanical and Crafts: Arithmetic Reasoning, Mechanical Comprehension, Auto and Shop Information, and Electronics Information
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 21/79
Business and Clerical: Word Knowledge, Paragraph Comprehension, Mathematics Knowledge, and Coding Speed Electronics and Electrical: Arithmetic Reasoning, Mathematics Knowledge, Electronics Information, and General Science Health, Social, and Technology: Word Knowledge, Paragraph Comprehension, Arithmetic Reasoning, and Mechanical Comprehension
TABLE 6.1 The Armed Services Vocational Aptitude Battery (ASVAB) Subtests
Arithmetic Reasoning* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)
16-item test of arithmetic word problems based on simple calculation
Mathematics Knowledge* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)
25-item test of algebra, geometry, fractions, decimals, and exponents
Word Knowledge* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)
35-item test of vocabulary knowledge and synonyms
Paragraph Comprehension* (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec3#ch06fn03)
15-item test of reading comprehension in short paragraphs
General Science 25-item test of general knowledge in physical and biological science
Mechanical Comprehension 25-item test of mechanical and physical principles
Electronics Information 20-item test of electronics, radio, and electrical principles
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 22/79
Assembling Objects 16-item test of mechanical and assembly concepts
Auto and Shop 25-item test of basic knowledge of autos, shop practices, and tool usage
*Armed Forces Qualifying Test (AFQT).
The problem with forming composites in this manner is that they are so highly correlated with one another as to be essentially redundant. In fact, the average intercorrelation among these seven composite scores is .86 (Murphy, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1187) )! Clearly, composites do not always provide differential information about specific aptitudes. Perhaps that is why recent editions of the ASVAB have steered clear of multiple, complex composites. Instead, the emphasis is on simpler composites that are composed of highly related constructs. For example, a Verbal Ability composite is derived from Word Knowledge and Paragraph Comprehension, two highly inter-related subtests. In like manner, a Math Ability composite is obtained from the combination of Arithmetic Reasoning and Mathematics Knowledge.
Some researchers have concluded that the ASVAB does not function as a multiple aptitude test battery but achieves success in predicting diverse vocational assignments because the composites invariably tap a general factor of intelligence. For example, Dunai and Porter (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib441) ) report favorably on the ASVAB as a predictor of entry-level success of radiography students in Air Force medical training. The ASVAB may be a good test of general intelligence, but it falls short as a multiple aptitude test battery. Another concern is that the test may possess different psychometric structures for men and women. Specifically, the Electronics Information subtest is a good measure of g (the general factor of intelligence) for men but not women (Ree & Carretta, 1995). The likely explanation for this is that men are about nine times more likely to enroll in high school classes in electronics and auto shop, and men, therefore, have the opportunity for their general ability to shape what they learn about electronics information, whereas women do not. Scores on this subtest will, therefore, function as a measure of achievement (what has already been learned) but not as an index of aptitude (forecasting future results).
Research on a computerized adaptive testing (CAT) version of the ASVAB has been under way since the 1980s. Computerized adaptive testing is discussed in Topic 12B (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch12lev1sec5#ch12box3) , Computerized Assessment and the Future of Testing. We provide a brief overview here. In CAT, the examinee takes the test while sitting at a computer terminal. The difficulty level of the items presented on the screen is continually readjusted as a function of the examinee’s ongoing performance. In general, an examinee who answers a subtest item correctly will receive a harder item, whereas an examinee who fails that item will receive an easier item. The computer uses item response theory as a basis for selecting items. Each examinee receives a unique set of test items tailored to his or her ability level.
In 1990, the CAT-ASVAB began to replace the paper-and-pencil ASVAB. Currently, more than two-thirds of all military applicants are tested with the computerized version. Larson (1994
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 23/79
(http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ) lists the reasons for adopting the CAT-ASVAB as follows:
Shorten overall testing time (adaptive tests require roughly one-half the items of standard tests). Increase test security by eliminating the possibility that test booklets could be stolen. Increase test precision at the upper and lower ability extremes. Provide a means for immediate feedback on test scores, since the computers used for testing can immediately score the tests and output the results. Provide a means for flexible test start times (unlike group-administered paper-and-pencil tests, for which everyone must start and stop at the same time, computer-based testing can be tailored to the examinees’ personal schedules) (Larson, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ).
Reliability and validity studies of the CAT-ASVAB provide strong support for its equivalence to the original test. In general, the computerized version of the instrument measures the same constructs as its paper- and-pencil counterpart—and does so in less time and with greater precision (Moreno & Segall, 1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1174) ). With the success of this project, the CAT-ASVAB and other tests likely will be expanded to measure new aspects of performance such as response latencies and to display unique item types such as visuospatial tests of objects in motion (Larson, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib952) ). The CAT-ASVAB has the potential to change the future of testing.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 24/79
6.4 PREDICTING COLLEGE PERFORMANCE As most every college student knows, a major use of aptitude tests is the prediction of academic performance. In most cases, applicants to college must contend with the Scholastic Assessment Tests (SAT) or the American College Test (ACT) assessment program. Institutions may set minimum standards on the SAT or ACT tests for admission, based on the knowledge that low scores foretell college failure. In this section we will explore the technical adequacy and predictive validity of the major college aptitude tests.
The Scholastic Assessment Test (SAT) Formerly known as the Scholastic Aptitude Tests, the Scholastic Assessment Test, or SAT, is the oldest of the college admissions tests, dating back to 1926. The SAT is published by the College Board (formerly the College Entrance Examination Board), a group formed in 1899 to provide a national clearinghouse for admissions testing. As noted by historian Fuess (1950 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib549) ), the purpose of a nationally based admissions test was “to introduce law and order into an educational anarchy which towards the close of the nineteenth century had become exasperating, indeed almost intolerable, to schoolmasters.” Over the years, the test has been extensively revised, continuously updated, and repeatedly renormed. In the early 1990s, the SAT was renamed the Scholastic Assessment Test to emphasize changes in content and format. The new SAT assesses mastery of high school subject matter to a greater extent than its predecessor but continues to tap reasoning skills. The SAT represents state of the art for aptitude testing.
The new SAT, released in 2005, consists of the SAT Reasoning Test and the SAT Subject Tests. The SAT Reasoning Test is used for college admission decisions, whereas the optional SAT Subject Tests typically are needed for advanced college placement in fields such as Biology, Chemistry, History, Foreign Languages, and Mathematics. We restrict our discussion here to the SAT Reasoning Test. For ease of discussion, we refer to it simply as the “SAT.”
The SAT consists of three sections, each containing three or four subtests (Table 6.2 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec4#ch06tab2) ). The Critical Reading section involves reading individual paragraphs and then answering multiple-choice questions about the passages. The questions embody three approaches:
TABLE 6.2 Sections and Subtests of the SAT Reasoning Test
Section Subtests
Critical Reading Extended Reasoning Literal Comprehension
Vocabulary in Context
Math Numbers and Operations Algebra and Functions Geometry and Measurement Data Analysis, Statistics, and Probability
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 25/79
Section Subtests
Writing Essay Improving Sentences Identifying Sentence Errors Improving Paragraphs
Vocabulary in Context—discerning the meaning of words from their context in the passage Literal Comprehension—understanding significant information directly available in the passage Extended Reasoning—following an argument or making inferences from the passage
Some questions in the Critical Reading section also engage a complex form of fill in the blanks. However, instead of testing for mere factual knowledge, the questions evaluate verbal comprehension. Here is a straightforward example:
Hoping to ________ the dispute, the family therapist proposed a concession that he felt would be ________ to both mother and daughter.
A. end . . . divisive B. overcome . . . unappealing C. protract . . . satisfactory D. resolve . . . acceptable E. enforce . . . useful
The correct answer is D. Of course, the SAT incorporates more difficult items of this genre.
The second part of the SAT is the Math section, consisting of three subtests. Collectively, these subtests assess basic math skills in algebra, geometry, statistics, and data analysis needed for successful navigation of college. Most of the questions are multiple-choice format, for example:
A special lottery was announced to select the student who will live in the only luxury apartment in student housing. In all, 50 juniors, 125 sophomores, and 175 freshmen applied. However, juniors were allowed to purchase 4 tickets each. What is the probability that the room will be awarded to a junior?
A. 1/5 B. 1/2 C. 2/5 D. 1/7 E. 2/7
The correct answer is C. In addition to multiple-choice questions, the Math section includes several items that require the student to generate a single correct answer and then enter it on the response sheet. For example:
What value of x satisfies both equations below?
x2 − 4 = 0 |4x + 6| = 2
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 26/79
The correct answer is −2. Strategies for finding a solution that might work with a multiple-choice question —trial and error, or process of elimination—are not likely to help with this style of question. Here the examinee must generate the correct answer by dint of careful analysis.
The Writing portion of the SAT now consists of a 25-minute Essay section and three multiple-choice subtests that evaluate the ability of the examinee to improve sentences, identify sentence errors, and improve paragraphs. In the Essay test, the examinee reads a short excerpt and then writes a short paper that takes a point of view. Here is an example of an excerpt and assignment:
A sense of happiness and fulfillment, not personal gain, is the best motivation and reward for one’s achievements. Expecting a reward of wealth or recognition for achieving a goal can lead to disappointment and frustration. If we want to be happy in what we do in life, we should not seek achievement for the sake of winning wealth and fame. The personal satisfaction of a job well done is its own reward.
Assignment: Are people motivated to achieve by personal satisfaction rather than by money or fame? Plan and write an essay in which you develop your point of view on this issue. Support your position with reasoning and examples taken from your reading, studies, experience, or observations. (College Board, 2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib324) )
The essay is evaluated by two trained readers on a 1 to 6 scale, resulting in a total score of 2 to 12 for the Essay test. Students also receive a separate score on a scale from 20 to 80 for the multiple-choice portion of the Writing section. Both these scores are combined for the overall section score for Writing. SAT scores for each of the three sections—Critical Reading, Math, and Writing—are now reported on the familiar 200- to 800-point scale, with an approximate mean of 500 and standard deviation of 100.
Great care is taken in the construction of new forms of the SAT because unfailing reliability and a high degree of parallelism are essential to the mission of this testing program. Historically, the internal consistency reliability of all sections is repeatedly in the range of .91 to .93; with only a few exceptions, test–retest correlations vary between .87 and .89. The standard error of measurement is 30 to 35 points.
Frey and Detterman (2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib544) ) conducted a sophisticated factor analytic study of the relationship between the SAT and g or general intelligence. Results for 917 youth who took the SAT and the ASVAB indicated a correlation of .82 between g (as extracted from ASVAB results) and SAT scores. They concluded that the SAT is an excellent measure of general cognitive ability.
The primary evidence for SAT validity is criterion-related, in this case, the ability to predict first-year college grades. Donlon (1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib431) , chap. VIII) reports a wealth of information on this point for earlier editions; we can only summarize trends here. In 685 studies, the combined SAT Verbal and Math scores correlated .42, on average, with college first-year grade point average. Interestingly, high school record (e.g., rank or grade point average) fares better than the SAT in predicting college grades (r = .48). But the combination of SAT and high school record proves even more predictive; these variables correlated .55, on average, with college first-year grade point average. Of course, these findings reflect a substantial restriction of range: low SAT-scoring high school students tend not to attend college. Donlon (1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib431) ) estimated that the real correlation without restriction of range (SAT + high school record) would be in the neighborhood of
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 27/79
.65. According to the College Board website, the combination of SAT and high school GPA continues to provide a robust correlation (r = .62) with freshman grades. Based on a sample of 151,316 students attending 110 colleges and universities across the United States, these results leave no room for doubt as to the general predictive power of SAT scores (www.collegeboard.com (http://www.collegeboard.com) ). However, the results also show that for students whose best language is not English (e.g., children of recent immigrants), the crucial reading and writing portions of the SAT underpredict freshman grades.
The American College Test (ACT) The American College Test (ACT) assessment program is a recent program of testing and reporting designed for college-bound students. In addition to traditional test scores, the ACT assessment program includes a brief 90-item interest inventory (based on Holland’s typology) and a student profile section (in which the student may list subjects studied, notable accomplishments, work experience, and community service). We will not discuss these ancillary measures here, except to note that they are useful in generating the Student Profile Report, which is sent to the examinee and the colleges listed on the registration folder.
Initiated in 1959, the ACT is based on the philosophy that direct tests of the skills needed in college courses provide the most efficient basis for predicting college performance. In terms of the number of students who take it, the ACT occupies second place behind the SAT as a college admissions test. The four ACT tests require knowledge of a subject area, but emphasize the use of that knowledge:
English (75 questions, 45 minutes). The examinee is presented with several prose passages excerpted from published writings. Certain portions of the text are underlined and numbered, and possible revisions for the underlined sections are presented; in addition, “no change” is one choice. The examinee must choose the best option. Mathematics (60 questions, 60 minutes). Here the examinee is asked to solve the kinds of mathematics problems likely to be encountered in basic college mathematics courses. The test emphasizes concepts rather than formulas and uses a multiple-choice format. Reading (40 questions, 35 minutes). This subtest is designed to assess the examinee’s level of reading comprehension; subscores are reported for social studies/sciences and arts/literature reading skills. Science Reasoning (40 questions, 35 minutes). This test assesses the ability to read and understand material in the natural sciences. The questions are drawn from data representations, research summaries, and conflicting viewpoints.
In addition to the area scores listed previously, ACT results are also reported as an overall Composite score, which is the average of the four tests. ACT scores are reported on a standard score 36-point scale. In 2012, the average ACT Composite score of high school graduates was 21.1, with a standard deviation of about 5 points.
Critics of the ACT program have pointed to the heavy emphasis on reading comprehension that saturates all four tests. The average intercor-relation of the tests is typically around .60. These data suggest that a general achievement/ability factor pervades all four tests; results for any one test should not be overinterpreted. Fortunately, college admission officers probably place the greatest emphasis on the Composite score, which is the average of the four separate tests. The ACT test appears to measure much the same thing as the SAT; the correlation between these two tests approaches .90. It is not surprising, then, that the predictive validity of the ACT Composite score rivals the SAT combined score, with correlations in the vicinity of .40 to .50 with college first-year grade point average. The predictive validity
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 28/79
coefficients are virtually identical for advantaged and disadvantaged students, indicating that the ACT tests are not biased.
Kifer (1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib887) ) does not question the technical adequacy of the ACT and similar testing programs but does protest the enormous symbolic power these tests have accrued. The heavy emphasis on test scores for college admissions is not a technical issue, but a social, moral, and political concern:
Selective admissions means simply that an institution cannot or will not admit each person who completes an application. Choices of who will or will not be admitted should be, first of all, a matter of what the institution believes is desirable and may or may not include the use of prediction equations. It is just as defensible to select on talent broadly construed as it is to use test scores however high. There are talented students in many areas—leaders, organizers, doers, musicians, athletes, science award winners, opera buffs—who may have moderate or low ACT scores but whose presence on a campus would change it.
The reader may wish to review Topic 6B (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06lev2sec21) , Test Bias and Other Controversies, for further discussion of this point.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 29/79
6.5 POSTGRADUATE SELECTION TESTS Graduate and professional programs also rely heavily on aptitude tests for admission decisions. Of course, many other factors are considered when selecting students for advanced training, but there is no denying the centrality of aptitude test results in the selection decision. For example, Figure 6.6 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec5#ch06fig6) depicts a fairly typical quantitative weighting system used in evaluating applicants for graduate training in psychology. The reader will notice that an overall score on the Graduate Record Exam (GRE) receives the single highest weighting in the selection process. We review the GRE in the following sections, as well as admission tests used by medical schools and law schools.
FIGURE 6.6 Representative Weighting Scheme Used by Graduate Program Admission Committees in Psychology
Graduate Record Exam (GRE) The GRE is a multiple-choice and essay test widely used by graduate programs in many fields as one component in the selection of candidates for advanced training. The GRE offers subject examinations in many fields (e.g., Biology, Computer Science, History, Mathematics, Political Science, Psychology), but the heart of the test is the general test designed to measure verbal, quantitative, and analytical writing aptitudes. The verbal section (GRE-V) includes verbal items such as analogies, sentence completion, antonyms, and reading comprehension. The quantitative section (GRE-Q) consists of problems in algebra, geometry, reasoning, and the interpretation of data, graphs, and diagrams. The analytical writing section (GRE-AW) was added in October 2002 as a measure of higher-level critical thinking and analytical writing skills. It consists of two writing tasks: A 30-minute essay in which the applicant analyzes an issue, and a 30-minute essay in which the applicant analyzes an argument. Here is an example of an issue question:
As people rely more and more on technology to solve problems, the ability of humans to think for themselves will surely deteriorate.
Discuss the extent to which you agree or disagree with the statement and explain your reasoning for the position you take. In developing and supporting your position, you should consider ways in which
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 30/79
the statement might or might not hold true and explain how these considerations shape your position. (www.ets.org/gre (http://www.ets.org/gre) ).
The argument questions entail reading a short paragraph that invokes an argument, and writing a critique of the argument.
Beginning in 2012, the first two scores (GRE-V and GRE-Q) were reported as standard scores with a mean of about 150 and a range of 130 to 170. This new scaling metric represents a substantial change from the familiar GRE scale employed since the 1950s. Prior to 2012, the first two scores (GRE-V and GRE-Q) were reported as standard scores with a mean of about 500 and standard deviation of 100 (range of 200 to 800). Actually, the mean scores shifted from year to year because all test results were anchored to a standard reference group of 2,095 college seniors tested in 1952 on the verbal and quantitative portions of the test. Historically, graduate programs have paid more attention to the first two parts of the test (GRE- V and GRE-Q). Recently, programs have acknowledged the importance of writing skills in their applications, which explains the addition of the analytical writing section (GRE-AW).
Scoring of the analytical writing section is based on 6-point holistic ratings provided independently by two trained raters. If the two scores differ by more than one point on the scale, the discrepancy is adjudicated by a third GRE-AW reader. According to the GRE Board (www.gre.org (http://www.gre.org) ), the GRE-AW test reveals smaller ethnic group differences than found in the multiple-choice sections. For example, the differences between African American and Caucasian examinees and between Hispanic and Caucasian examinees are smaller on the GRE-AW than on the GRE-V or GRE-Q. This suggests that the new test does not unduly penalize ethnic groups traditionally underrepresented in graduate programs.
The reliability of the GRE is strong, with internal consistency reliability coefficients typically around .90 for the three components. The validity of the GRE commonly has been examined in relation to the ability of the test to predict performance in graduate school. Performance has been operationalized mainly as grade point average, although faculty ratings of student aptitude also have been used. For example, based on a meta-analytic review of 22 studies with a total of 5,186 students, Morrison and Morrison (1995 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1179) ) concluded that GRE-V correlated .28 and GRE-Q correlated .22 with graduate grade point average. Thus, on average, GRE scores accounted for only 6.3 percent of the variance in graduate-level academic performance. In a recent study of 170 graduate students in psychology at Yale University, Sternberg and Williams (1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1570) ) also found minimal correlations between GRE scores and graduate grades. When GRE scores were correlated with faculty ratings on five variables (analytical, creative, practical, research, and teaching abilities), the correlations were even lower, for the most part hovering right around zero. The single exception was the GRE analytical thinking score, which correlated modestly with almost all of the faculty ratings. However, this correlation was observed only for men (on the order of r = .3), whereas for women it was almost exactly zero in every case! Based on these and similar studies, the consensus would appear to be that excessive reliance on the GRE for graduate school selection may overlook a talented pool of promising graduate students.
However, other researchers are more supportive in their evaluation of the GRE, noting that the correlation of GRE scores and graduate grades is not a good index of validity because of the restriction of range problem (Kuncel, Campbell, & Ones, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib930) ). Specifically, applicants with low GRE scores are unlikely to be accepted for graduate training in the first place and, thus, relatively little information is available with respect to whether low scores predict poor academic performance. Put simply, the correlation of GRE scores with graduate academic performance is based
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 31/79
mainly on persons with middle to high levels of GRE scores, that is, GRE-V + GRE-Q totals of 1,000 and up. As such, the correlation will be attenuated precisely because those with low GREs are not included in the sample. Another problem with validating the GRE against grades in graduate school is the unreliability of the criterion (grades). Based on the expectation that graduate students will perform at high levels, some professors may give blanket A’s such that grades do not reflect real differences in student aptitudes. This would lower the correlation between the predictor (GRE scores) and the criterion (graduate grades). When these factors are accounted for, many researchers find reason to believe the GRE is still a valid tool for graduate school selection (Powers, 2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1319) ).
In a comprehensive meta-analysis of 1,753 independent groups of students, Kuncel, Hezlett, and Ones (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib931) ) confirmed the validity of the GRE tests (Verbal, Quantitative, and Analytical) for the prediction of graduate student performance. The total sample size for their analysis was huge, including 82,659 students. The breadth of their investigation allowed them to code studies for several different forms of student accomplishment. GRE general test scores were significantly associated with the following student outcomes: first-year GPA, overall GPA, comprehensive exam scores, faculty ratings, and publication citation counts. The researchers also discovered that the GRE Psychology subject test outperformed the general test as a predictive measure of student success.
Medical College Admission Test (MCAT) The MCAT is required of applicants to almost all medical schools in the United States. The test is designed to assess achievement of the basic skills and concepts that are prerequisites for successful completion of medical school. There are three multiple-choice sections (Verbal Reasoning, Physical Sciences, Biological Sciences) (40 questions). The Verbal Reasoning section is designed to evaluate the ability to understand and apply information and arguments presented in written form. Specifically, the test consists of several passages of about 500 to 600 words each, taken from humanities, social sciences, and natural sciences. Each passage is followed by several questions based on information included in the passage. The Physical Sciences section (52 questions) is designed to evaluate reasoning in general chemistry and physics. The Biological Sciences section (52 questions) is designed to evaluate reasoning in biology and organic chemistry. These physical and biological science sections contain 10 to 11 problem sets described in about 250 words each, with several questions following.
Following the three required parts of the MCAT, an optional trial section of 32 questions is administered. This portion is not scored. The purpose of the trial section is to pretest questions for future exams. Some trial questions are designed for a new section of the MCAT, Psychological, Social, and Biological Foundations of Behavior, scheduled to commence in 2015. This new section will test knowledge of important concepts in introductory psychology, sociology, and biology, related to mental processes and behavior. The addition of this section acknowledges that effective doctors need to understand the whole person, including social and cultural determinants of health and health-related behaviors.
Each of the MCAT scores is reported on a scale from 1 to 15 (means of about 8.0 and standard deviations of about 2.5). The reliability of the test is lower than that of other aptitude tests used for selection, with internal consistency and split-half coefficients mainly in the low .80s (Gregory, 1994a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib646) ). MCAT scores are mildly predictive of success in medical school, but once again the restriction of range conundrum (previously discussed in relation to the GRE) is at play. In particular, examinees with low MCAT scores who would presumably confirm the validity of the test by performing poorly in medical school are rarely admitted, which reduces the apparent validity of the test.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 32/79
Julian (2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib849) ) confirmed the validity of the MCAT for predicting medical school performance by following 4,076 students who entered 14 medical schools in 1992 and 1993. Outcome variables included GPA and national medical licensing exam scores. When corrected for restriction of range, the predictive validity coefficients for MCAT scores were impressive, on the order of .6 for medical school grades, and as high as .7 for licensing exam scores. In fact, the MCAT scores were so strongly predictive of licensing exam scores that adding undergraduate GPAs into the equation did not appreciably boost the correlation. Julian (2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib849) ) concludes that MCAT scores essentially replace the need for undergraduate GPAs in medical school student selection because of their remarkable capacity to predict medical licensing exam scores.
Law School Admission Test (LSAT) The LSAT is more than 60 years old. The test arose in the 1940s as a group effort from deans of leading law schools, who used first year grades in the early validation of the instrument (LaPiana, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib950) ). Practicality was a major impetus for test development, as law schools were flooded with worthy applicants. Also, there was an idealistic desire to ensure that admission to law school was based on aptitude and potential, not on privilege or connection. A leading figure in LSAT development has noted:
What makes us Americans is our adherence to the system that governs our nation. If that’s true, then being a lawyer is one of the most important jobs in American society because it is the lawyer’s job to make sure the law works and serves people. And if that is true, than the American legal profession is much too important to be left in the hands of a self-perpetuating elite. It has to be open to all Americans with the talent and ability to do legal work, no matter how their last names are spelled or where they or their ancestors were born or the color of their skin (LaPiana, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib950) , p. 12).
About 150,000 individuals take the LSAT each year. Of course, many other variables come into play in law school admissions, but test results probably are the single most important factor.
The LSAT is a half-day standardized test required of applicants to virtually every law school in the United States. The test is designed to measure skills considered essential for success in law school, including the reading and understanding of complex material, the organization and management of information, and the ability to reason critically and draw correct inferences. The LSAT consists of multiple-choice questions in four areas: reading comprehension, analytical reasoning, and two logical reasoning sections. An additional section is used to pretest new test items and to preequate new test forms, but this section does not contribute to the LSAT score. The score scale for the LSAT extends from a low of 120 to a high of 180. In addition to the objective portions, a 35-minute writing sample is administered at the end of the test. The section is not scored, but copies of the writing sample are sent to all law schools to which the examinee applies.
The LSAT has acceptable reliability (internal consistency coefficients in the .90s) and is regarded as a moderately valid predictor of law school grades. Yet, in one fascinating study, LSAT scores correlated more strongly with state bar test results than with law school grades (Melton, 1985). This speaks well for the validity of the test, insofar as it links LSAT scores with an important, real-world criterion.
In recent years, those responsible for law school admissions have shown interest in selection methods that go beyond the LSAT. One example is a promising project from the University of California, Berkeley, which ambitiously seeks to assess 26 traits identified as crucial to effective performance of lawyers
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 33/79
(Chamberlin, 2009 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib293) ). Using focus groups and individual interviews, psychologist Sheldon Zedeck and lawyer Marjorie Shultz distilled these 26 traits, which include varied capacities like practical judgment, researching the law, writing, integrity/honesty, negotiation skills, developing relationships, stress management, fact finding, diligence, listening, and community involvement/service. Next they developed realistic scenarios designed to evaluate one or more of these qualities. A sample question might ask the applicant to take the role of a team leader in a law firm. A verbal fight breaks out between two of the team members over the best way to proceed with the project. What should the team leader do? A number of options are listed, and the applicant is asked to rank them from best to worst. The format of the questions is varied. For other questions, the applicant might be asked to provide a short written response. Initial research with this yet- unnamed instrument indicates that it predicts success in the practice of law substantially better than the LSAT.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 34/79
6.6 EDUCATIONAL ACHIEVEMENT TESTS Achievement tests permit a wide range of potential uses. Practical applications of group achievement tests include the following:
To identify children and adults with specific achievement deficits who might need more detailed assessment for learning disabilities To help parents recognize the academic strengths and weaknesses of their children and thereby foster individual remedial efforts at home To identify classwide or schoolwide achievement deficiencies as a basis for redirection of instructional efforts To appraise the success of educational programs by measuring the subsequent skill attainment of students To group students according to similar skill level in specific academic domains To identify the level of instruction that is appropriate for individual students
Thus, achievement tests serve institutional goals such as monitoring schoolwide achievement levels, but also play an important role in the assessment of individual learning difficulties. As previously noted, different kinds of achievement tests are used to pursue these two fundamental applications (institutional and individual). Institutional goals are best served by group achievement test batteries, whereas individual assessment is commonly pursued with individual achievement tests (even though group tests may play a role here, too). Here we focus on group educational achievement tests.
Virtually every school system in the nation uses at least one educational achievement test, so it is not surprising that test publishers have responded to the widespread need by developing a panoply of excellent instruments.
In the following section, we describe several of the most widely used group standardized achievement tests. We limit our coverage here to three educational achievement tests, each distinctive in its own way. The Iowa Tests of Basic Skills (ITBS) is representative of the huge industry of standardized achievement testing used in virtually all school systems nationwide. The Metropolitan Achievement Test is of the same genre as the ITBS but embodies a new and powerful technique of reading assessment known as the Lexile approach and, thus, merits special attention. Finally, almost everyone has heard of the Tests of General Educational Development, known familiarly as the “GED.” We would be remiss not to discuss this testing program.
Iowa Tests of Basic Skills (ITBS) First published in 1935, the Iowa Tests of Basic Skills (ITBS) were most recently revised and restandardized in 2001. The ITBS is a multilevel battery of achievement tests that covers grades K through 8. A companion test, the Tests of Achievement and Proficiency (TAP), covers grades 9 through 12. In order to expedite direct and accurate comparisons of achievement and ability, the ITBS and the TAP were both concurrently normed with the Cognitive Abilities Test (CogAT), a respected group test of general intellectual ability.
The ITBS is available in several levels that correspond roughly with the ages of the potential examinees: levels 5–6 (grades K–1), levels 7–8 (grades 2–3), and levels 9–14 (grades 3–8). The basic subtests for the older levels measure vocabulary, reading, language, mathematics, social studies, science, and sources of information (e.g., uses of maps and diagrams). A brief description of the subtests for grades 3–8 is
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 35/79
provided in Table 6.3 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06tab3) .
From the first edition onward, the ITBS has been guided by a pragmatic philosophy of educational measurement. The manual states the purpose of testing as follows:
The purpose of measurement is to provide information which can be used in improving instruction. Measurement has value to the extent that it results in better decisions which directly affect pupils.
To this end, the ITBS incorporates a criterion-referenced skills analysis to supplement the usual array of norm-referenced scores. For example, one feature available from the publisher’s scoring service is item- level information. This information indicates topic areas, items sampling the topic, and correct or wrong response for each item. Teachers, therefore, have access to a wealth of diagnostic-instructional information for each student. Whether this information translates to better instruction—as the test authors desire—is very difficult to quantify. As Linn (1989 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib996) ) notes, “We must rely mostly on logic, anecdotes, and opinions when it comes to answering such questions.”
The technical properties of the ITBS are beyond reproach. Historically, internal consistency and equivalent-form reliability coefficients are mostly in the mid-.80s to low .90s. Stability coefficients for a one-year interval are almost all in the .70 to .90 range. The test is free from overt racial and gender bias, as determined by content evaluation and item bias studies. The year 2000 norms for the test were empirically developed from large, representative national probability samples.
TABLE 6.3 Brief Description of ITBS Subtests for Grades 3–8
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 36/79
Vocabulary: A word is presented in the context of a short phrase or sentence, and students select the correct meaning from multiple-choice alternatives.
Reading Comprehension: Students read a brief passage and answer multiple-choice questions that require inference or generalization.
Spelling: Each multiple-choice item presents four words, one of which may be misspelled, and fifth option, no mistakes.
Capitalization: Test items require students to identify errors of under- or overcapitalization present in brief written passages.
Punctuation: Multiple-choice items require students to identify errors of punctuation involving commas, apostrophes, quotation marks, colons, and so on, or choose no mistakes.
Usage and Expression: In the first part, students identify errors in usage or expression; in the second part, students choose the best way to express an idea.
Math Concepts and Estimation: Questions deal with computation, algebra, geometry, measurement, and probability and statistics.
Math Problem Solving and Data Interpretation: Questions may involve multistep word problems or interpretation of tables and graphs.
Math Computation: These test items require the use of one arithmetic operation (addition, subtraction, multiplication, or division) with whole numbers, fractions, and decimals.
Social Studies: These questions involve aspects of history, geography, economics, and so on that are ordinarily covered in most school systems.
Science: These test items involve aspects of biology, ecology, space science, and physical sciences ordinarily covered in most school systems.
Maps and Diagrams: These questions evaluate the ability to use maps for a variety of purposes such as determining locations, directions, and distances.
Reference Materials: These questions measure the ability to use reference materials and library resources.
Item content of the ITBS is judged relevant by curriculum experts and reviewers, which speaks to the content validity of the test (Lane, 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib949) ; Linn, 1989 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib996) ). Although the predictive validity of the latest ITBS has not been studied extensively, evidence from prior editions is very encouraging. For example, ITBS scores correlate moderately with high school grades (r’s around .60). The ITBS is not a perfect instrument, but it represents the best that modern test development methods can produce.
Metropolitan Achievement Test (MAT) The Metropolitan Achievement Test dates back to 1930 when the test was designed to meet the curriculum assessment needs of New York City. The stated purpose of the MAT is “to measure the achievement of students in the major skill and content areas of the school curriculum.” The MAT is concurrently normed with the Otis-Lennon School Ability Test (OLSAT).
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 37/79
Now in its eighth edition, the MAT is a multilevel battery designed for grades K through 12 and was most recently normed in 2000. The areas tested by the MAT include the traditional school-related skills:
Reading Mathematics Language Writing Science Social Studies
An attractive feature of the MAT is that student reading scores are reported as Lexile measures, a new and practical indicator of reading level. Lexile measures are likely to become a standard feature in most group achievement tests in the years ahead, so it is worth a brief detour to explain their nature and significance.
Lexile Measures The Lexile approach is a major new improvement in the assessment of reading skill. It was developed over a span of more than 12 years using millions of dollars in grant funds from the National Institute of Child Health and Human Development (NICHD) (www.lexile.com (http://www.lexile.com) ). The Lexile approach is based on two simple, commonsense assumptions, namely (1) reading materials can be placed on a continuum as to difficulty level (comprehensibility) and (2) readers can be ordered on a continuum as to reading ability. The Lexile framework provides a common metric for matching readers and text, which, in turn, permits parents and educators to choose appropriate reading materials for children.
The Lexile scale (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss185) is a true interval scale. The Lexile measure for a reading selection is a specific number indicating the reading demand of the text based on the semantic difficulty (vocabulary) and syntactic complexity (sentence length). Lexile measures for reading selections typically range from 200L to 1,700L (Lexiles). The Lexile score for a student, obtained from the Reading Comprehension test of the MAT or other achievement tests, is a precise index of the student’s reading ability, calibrated on the same scale as the Lexile measure for text. The value of the Lexile approach is that student comprehension can be predicted as a function of the disparity between the demands of the text and the student’s ability. For example, when readers are well targeted (the difference between text and reader is close to 0 Lexiles), research indicates that reader comprehension will be about 75 percent. When the text difficulty exceeds the reader’s ability by 250L, comprehension drops to about 50 percent. When the skill of the reader exceeds the demands of the text by 250L, comprehension is about 90 percent (www.lexile.com (http://www.lexile.com) ).
The Lexile approach has a number of potential benefits and applications for teachers and parents. Teachers can look up Lexile measures for specific books (the Lexile corporation has evaluated over 30,000 titles to date) as a way of building a library of titles at varying levels. Also, they can produce individualized reading lists suitable for each student. Likewise, parents can select well-matched books to read to their children. Stenner (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1553) ) captures the allure of the Lexile approach as follows:
One of the great strengths of the Lexile Framework is the way it encourages thought about what forecasted comprehension rate would be optimal for different instructional contexts. Harry Potter and the Goblet of Fire is a 910L text. Readers at 400L to 500L can nonetheless enjoy listening to this story read aloud. A 700L reader could read the text in a one-on-one tutoring context. A 900L reader
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 38/79
will disappear for an hour or two, fully capable of self-engaging with the text, and a 1600L adult reader can become so engrossed that a two-hour plane ride flies by.
The Lexile approach is not a panacea, but it is a major improvement in the assessment of reading skill.
Tests of General Educational Development (GED) Another widely used achievement test battery is the Tests of General Educational Development (GED), developed by the American Council on Education and administered nationwide for high school equivalency certification (www.acenet.edu (http://www.acenet.edu) ). The GED consists of multiple-choice examinations in five educational areas:
Language Arts—Writing Language Arts—Reading Mathematics Science Social Studies
The Language Arts—Writing section also contains an essay question that examinees must answer in writing. The essay question is scored independently by two trained readers according to a 6-point holistic scoring method. The readers make a judgment about the essay based on its overall effectiveness in comparison to the effectiveness of other essays.
The GED comes in numerous alternate forms. Typically, internal consistency reliabilities for the subscales are above .90. However, the interrater reliability of scoring on the writing samples is more modest, typically between .6 and .7. These findings indicate that a liberal criterion for passing this subtest is appropriate so as to reduce decision errors. Regarding validity, the GED correlates very strongly (r = .77) with the graduation reading test used in New York (Whitney, Malizio, & Patience, 1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1753) ). Furthermore, the standards for passing the GED are more stringent than those employed by most high schools: Currently, individuals who receive a passing score for a GED credential outperform at least 40 percent of graduating high school seniors (www.acenet.edu (http://www.acenet.edu) ).
The GED emphasizes broad concepts rather than specific facts and details. In general, the purpose of the GED is to allow adults who did not graduate from high school to prove that they have obtained an equivalent level of knowledge from life experiences or independent study. Employers regard the GED as equivalent (if not superior) to earning a high school diploma. Successful performance on the GED enables individuals to apply to colleges, seek jobs, and request promotions that require a high school diploma as a prerequisite. Rogers (1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1382) ) provides an unusually thorough review of the GED.
Additional Group Standardized Achievement Tests In addition to the previously described batteries, a few other widely used group standardized achievement tests deserve brief listing. These instruments are depicted in Table 6.4 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec6#ch06tab4) .
TABLE 6.4 Selected Group Achievement Tests for Elementary and Secondary School Assessment
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 39/79
Iowa Tests of Educational Development (ITED) Designed for grades 9 through 12, the objective of this test battery is to measure the fundamental goals or generalized skills of education that are independent of the curriculum. Most of the test items require the synthesis of knowledge or a multiple-step solution.
Tests of Achievement and Proficiency (TAP) This instrument is designed to provide a comprehensive appraisal of student progress toward traditional academic goals in grades 9 through 12. This test is co-normed with the ITED and the CogAT.
Stanford Achievement Test (SAchT)
Along with the ITBS, the SAchT is one of the leading contemporary achievement tests. Dating back more than 80 years and now in its tenth edition, it is administered to more than 15 million students every year.
TerraNova CTBS For grades 1 through 12, this multi-level test combines multiple-choice questions with constructed response items that require students to produce correct answers, not just select them from alternatives.
TOPIC 6B Test Bias and Other Controversies
6.7 The Question of Test Bias (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06lev1sec7)
Case Exhibit 6.1 The Impact of Culture on Testing Bias (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06exh1)
6.8 Social Values and Test Fairness (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec8#ch06lev1sec8)
6.9 Genetic and Environmental Determinants of Intelligence (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec9#ch06lev1sec9)
6.10 Origins and Trends in Racial IQ Differences (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec10#ch06lev1sec10)
6.11 Age Changes in Intelligence (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec11#ch06lev1sec11)
6.12 Generational Changes in IQ Scores (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec12#ch06lev1sec12)
An intelligence test is a neutral, inconsequential tool until someone assigns significance to the results derived from it. Once meaning is attached to a person’s test score, that individual will experience many repercussions, ranging from superficial to life-changing. These repercussions will be fair or prejudiced, helpful or harmful, appropriate or misguided—depending on the meaning attached to the test score.
Unfortunately, the tendency to imbue intelligence test scores with inaccurate and unwarranted connotations is rampant. Laypersons and students of psychology commonly stray into one thicket of
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 40/79
harmful misconceptions after another. Test results are variously overinterpreted or underinterpreted, viewed by some as a divination of personal worth but devalued by others as trivial and unfair.
The purpose of this topic is to clarify further the meaning of intelligence test scores in the light of relevant behavioral research. We begin by dispelling a number of everyday misconceptions about IQ and then pursue several empirically based issues—some would say controversies—that bear on the meaning of intelligence test scores:
The question of test bias Genetic and environmental effects on intelligence Origins of IQ differences between African Americans and Caucasian Americans The fate of intelligence in middle and old age Generational changes in intelligence test scores
The underlying theme of this section is that intelligence test scores are best understood within the framework of modern psychological research. The reader is warned that the research issues pursued here are complex, confusing, and occasionally contradictory. However, the rewards for grappling with these topics are substantial. After all, the meaning of intelligence tests is demarcated, sharpened, and refined entirely by empirical research.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 41/79
6.7 THE QUESTION OF TEST BIAS Beyond a doubt, no practice in modern psychology has been more assailed than psychological testing. Commentators reserve a special and often vehement condemnation for ability testing in particular. In his wide-ranging response to the hundreds of criticisms aimed at mental testing, Jensen (1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ) concluded that test bias is the most common rallying point for the critics. In proclaiming test bias (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss328) , the skeptics assert in various ways that tests are culturally and sexually biased so as to discriminate unfairly against racial and ethnic minorities, women, and the poor. We cite here a sampling of verbatim criticisms (Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ):
Intelligence tests are sadly misnamed because they were never intended to measure intelligence and might have been more aptly called CB (cultural background) tests. Persons from backgrounds other than the culture in which the test was developed will always be penalized. There are enormous social class differences in a child’s access to the experiences necessary to acquire the valid intellectual skills. IQ scores reported for African Americans and low socioeconomic groups in the United States reflect characteristics of the test rather than of the test takers. The poor performance of African American children on conventional tests is due to the biased content of the tests; that is, the test material is drawn from outside the African American culture. Women are not so good as men at mathematics only because women have not taken as much math in high school and college.
Are these criticisms valid? The investigation of this question turns out to be considerably more complicated than the reader might suppose. A most important point is that appearances can be deceiving. As we will explain subsequently, the fact that test items “look” or “feel” preferential to one race, sex, or social class does not constitute proof of test bias. Test bias is an objective, empirical question, not a matter of personal judgment.
Although critics may be loath to admit it, dispassionate and objective methods for investigating test bias do exist. One purpose of this section is to present these methods to the reader. However, an aseptic discussion of regression equations and statistical definitions of test bias would be incomplete, only half of the story. Conceptions of test bias are irretrievably intermingled with notions of test fairness. A full explanation of the story surrounding the test-bias controversy requires that we investigate the related issue of test fairness, too.
Differences in terminology abound in this area, so it is important to set forth certain fundamental distinctions before proceeding. Test bias is a technical concept amenable to impartial analysis. The most salient methods for the objective assessment of test bias are discussed in the following. In contrast, test fairness reflects social values and philosophies of test use, particularly when test use extends to selection for privilege or employment. Much of the passion that surrounds the test-bias controversy stems from a failure to distinguish test bias from test fairness. To avoid confusion, it is crucial to draw a sharp distinction between these two concepts. We include separate discussions of test bias and test fairness, beginning with an analysis of why test bias is such a controversial topic.
The Test-Bias Controversy
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 42/79
The test-bias controversy has its origins in the observed differences in average IQ among various racial and ethnic groups. For example, African Americans score, on average, about 15 points lower than White Americans on standardized IQ tests. This difference reduces to 7 to 12 IQ points when socioeconomic disparities are taken into account. The existence of marked racial/ethnic differences in ability test scores has fanned the fires of controversy over test bias. After all, employment opportunities, admission to college, completion of a high school diploma, and assignment to special education classes are all governed, in part, by test results. Biased tests could perpetuate a legacy of racial discrimination. Test bias is deservedly a topic of intense scrutiny by both the public and the testing professions.
One possibility is that the observed IQ disparities indicate test bias rather than meaningful group differences. In fact, most laypersons and even some psychologists would regard the magnitude of race differences in IQ as prima facie evidence that intelligence tests are culturally biased. This is an appealing argument, but a large difference between defined subpopulations is not a sufficient basis for proving test bias. The proof of test bias must rest on other criteria outlined in the following section.
When do test score differences between groups signify test bias? We begin by reviewing the criteria that should be used to investigate test bias of any kind, whether for race, gender, or any other defining characteristic.
Criteria of Test Bias and Test Fairness The topic of test bias has received wide attention from measurement psychologists, test developers, journalists, test critics, legislators, and the courts. Cole and Moss (1998) underscore an unsettling consequence of the proliferation of views held on this topic, namely, concepts of test bias have become increasingly intricate and complex. Furthermore, the understanding of test bias is made difficult by the implicit and often emotional assumptions—held even by scholars—that may lead honest persons to view the same information in different ways.
In part, disagreements about test bias are perpetuated because adversaries in this debate fail to clarify essential terminology. Too often, terms such as test bias and test fairness are considered interchangeable and thrown about loosely without definition. We propose that test bias and test fairness commonly refer to markedly different aspects of the test-bias debate. Careful examination of both concepts will provide a basis for a more reasoned discussion of this controversial topic.
As interpreted by most authorities in this field, test bias refers to objective statistical indices that examine the patterning of test scores for relevant sub-populations. Although experts might disagree about nuances, on the whole there is a consensus about the statistical criteria that indicate when a test is biased. We will expand this point later, but we can provide the reader with a brief preview here: In general, a test is deemed biased if it is differentially valid for different subgroups. For example, a test would be considered biased if the scores from appropriate subpopulations did not fall on the same regression line for a relevant criterion.
In contrast to the narrow concept of test bias, test fairness is a broad concept that recognizes the importance of social values in test usage. Even a test that is unbiased according to the traditional technical criteria of homogeneous regression might still be deemed unfair because of the social consequences of using it for selection decisions. The crux of the debate is this: Test bias (a statistical concept) is not necessarily the same thing as test fairness (a values concept). Ultimately, test fairness is based on social conceptions such as one’s image of a just society. In the assessment of test fairness, subjective values are of overarching importance; the statistical criteria of test bias are merely ancillary. We will return to this point
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 43/79
later when we analyze the link between social values and test fairness. But let us begin with a traditional presentation of technical criteria for test bias.
The Technical Meaning of Test Bias: A Definition One useful way to examine test bias is from the technical perspective of test validation. The reader will recall from an earlier chapter that a test is valid when a variety of evidence supports its utility and when inferences derived from it are appropriate, meaningful, and useful. One implication of this viewpoint is that test bias can be equated with differential validity for different groups:
Bias is present when a test score has meanings or implications for a relevant, definable subgroup of test takers that are different from the meanings or implications for the remainder of the test takers. Thus, bias is differential validity of a given interpretation of a test score for any definable, relevant subgroup of test takers. (Cole & Moss, 1998)
Perhaps a concrete example will help clarify this definition. Suppose a simple word problem arithmetic test were used to measure youngsters’ addition skills. The problems might be of the form “If you have two six-packs of pop, how many cans do you have altogether?” Suppose, however, the test is used in a group of primarily Spanish-speaking seventh graders. With these children, low scores might indicate a language barrier, not a problem with arithmetic skills. In contrast, for English-speaking children low scores would most likely indicate a deficit in arithmetic skills. In this example, the test has differential validity, predicting arithmetic deficits quite well for English-speaking children but very poorly for Spanish- speaking children. According to the technical perspective of test validation, we would conclude that the test is biased.
Although the general definition of test bias refers to differential validity, in practice the particular criteria of test bias fall under three main headings: content validity, criterion-related validity, and construct validity. We will review each of these categories, discussing relevant findings along the way. The coverage is illustrative, not exhaustive. Interested readers should consult Jensen (1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ), Cole and Moss (1998), and Reynolds and Brown (1984b (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1361) ).
Bias in Content Validity Bias in content validity is probably the most common criticism of those who denounce the use of standardized tests with minorities (Helms, 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib732) ; Hilliard, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib748) ; Kwate, 2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib935) ). Typically, critics rely on their own expert judgment when they expound one or more of the following criticisms of the content validity of ability tests:
The items ask for information that ethnic minority or disadvantaged persons have not had equal opportunity to learn. The scoring of the items is improper, since the test author has arbitrarily decided on the only correct answer and ethnic minorities are inappropriately penalized for giving answers that would be correct in their own culture but not that of the test maker. The wording of the questions is unfamiliar, and an ethnic minority person who may “know” the correct answer may not be able to respond because he or she does not understand the question
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 44/79
(Reynolds, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1359) ).
Any of these criticisms, if accurate, would constitute bona fide evidence of test bias. However, merely stating a criticism does not comprise proof. Where these criticisms fall short is that they are seldom buttressed by empirical evidence.
Reynolds (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1359) ) has offered a definition of content bias (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss67) for aptitude tests that addresses the preceding points in empirically defined, testable terms:
An item or subscale of a test is considered to be biased in content when it is demonstrated to be relatively more difficult for members of one group than another when the general ability level of the groups being compared is held constant and no reasonable theoretical rationale exists to explain group differences on the item (or subscale) in question.
This definition is useful because it proposes an empirical approach to the question of test bias.
In general, attempts to prove that expert-nominated items are culturally biased have not yielded the conclusive evidence that critics expect. McGurk (1953a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1098) , 1953b (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1099) , 1975 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1100) ) has written extensively on this topic, and we will use his classic study to illustrate this point. For his doctoral dissertation, McGurk asked a panel of 78 judges (professors, educators, and graduate students in psychology and sociology) to classify each of 226 items from well-known standardized tests of intelligence into one of three categories: least cultural, neutral, most cultural. McGurk administered these test items to hundreds of high school students. His primary analysis involved the test results for 213 African American students and 213 White students matched for curriculum, school, length of enrollment, and socio economic background.
McGurk (1953a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1098) , 1953b (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1099) ) discovered that the mean difference between African American and White students for the total hybrid test, expressed in standard deviation units, was .50. More pertinent to the topic of test bias in content validity was his comparison of scores on the 37 “most cultural” items versus the 37 “least cultural” items. For the “most cultural” items—the ones nominated by the judges as highly culturally biased—the difference was .30. For the “least cultural” items—the ones judged to be more fair to African Americans and other cultural minorities—the difference was .58. In other words, the items nominated as most cultural were relatively easier for African Americans; the items nominated as least cultural were relatively harder. This finding held true even after item difficulty was partialed out. Furthermore, the item difficulties for the two groups were almost perfectly correlated (r = .98 for “most cultural” and r = .96 for “least cultural” items). There is an important lesson here that test critics often overlook: “Expert” judges cannot identify culturally biased test items based on an analysis of item characteristics. Recent studies continue to reaffirm this conclusion (Reynolds, Lowe, & Saenz, 1999 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1363) ).
In general, with respect to well-known standardized tests of ability and aptitude, research has not supported the popular belief that the specific content of test items is a source of cultural bias against
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 45/79
minorities. This conclusion does not exonerate these tests with respect to other criteria of test bias, discussed in the following sections. Furthermore, we can point out that savvy test developers should be vigilant even to the impression of bias in test content, since the appearance of unfairness can affect public attitudes about psychological tests in quite tangible ways.
Bias in Predictive or Criterion-Related Validity The prediction of future performance is one important use of intelligence, ability, and aptitude tests. For this application of psychological testing, predictive validity is the most crucial form of validity in relation to test bias. In general, an unbiased test will predict future performance equally well for persons from different subpopulations. For example, an unbiased scholastic aptitude test will predict future academic performance of African Americans and White Americans with near-identical accuracy.
Reynolds (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1359) ) offers a clear, direct definition of test bias with regard to criterion-related or predictive validity bias (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss251) :
A test is considered biased with respect to predictive validity if the inference drawn from the test score is not made with the smallest feasible random error or if there is constant error in an inference or prediction as a function of membership in a particular group.
This definition of test bias invokes what might be referred to as the criterion of homogeneous regression. According to this viewpoint, a test is unbiased if the results for all relevant subpopulations cluster equally well around a single regression line. In order to clarify this point, we need to introduce concepts relevant to simple regression. The discussion is modeled after Cleary, Humphreys, Kendrick, and Wesman (1975 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib311) ).
Suppose we are using a scholastic aptitude test to predict first-year grade point average (GPA) in college. In the case of a simple regression analysis, prediction of future performance is made from an equation of the form:
Y = bX + a
where Y is the predicted college GPA, X is the score on the aptitude test, and b and a are constants derived from a statistical analysis of test scores and grades of prior students. We will not concern ourselves with how b and a are derived; the reader can find this information in any elementary statistics textbook.
FIGURE 6.7 Test Scores, Grades, and Regression Line for a Hypothetical Large Group of College
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 46/79
Students
Note: The dotted line shows how the regression line can be used to predict grade point average from the test score for a single, new subject.
The values of b and a correspond to important aspects of the regression line—the straight line that facilitates the most accurate prediction of the criterion (college grades) from the predictor (aptitude score) (Figure 6.7 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06fig7) ). In particular, b corresponds to the slope of the line, with higher values of b indicating a steeper slope and more accurate prediction. The value of a depicts the intercept on the vertical axis. The units of measurement for b and a cannot be specified in advance because they depend on the underlying scales used for X and Y. Notice in Figure 6.7 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06fig7) that the regression line is the reference for predicting grades from observed aptitude score.
According to the criterion of homogeneous regression, in an unbiased test a single regression line can predict performance equally well for all relevant subpopulations, even though the means for the different groups might differ. For example, in Figure 6.8 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06fig8) group A performs better than group B on both predictor and criterion. Yet, the relationship between aptitude score and grades is the same for both groups. In this hypothetical instance, the graph depicts the absence of bias on the aptitude test with respect to criterion-related validity.
A more complicated situation known as intercept bias is shown in Figure 6.9 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06fig9) . In this case, scores for the two groups do not cluster tightly around the single best regression line shown as a dotted line in the graph. Separate, parallel regression lines (and, therefore, separate regression equations) would be needed to facilitate accurate prediction. If a single regression line were used (the dotted line), criterion scores for group A would be overpredicted, whereas criterion scores for group B would be underpredicted. Thus, the use of a single regression line would constitute a clear instance of test bias,
because the test has differential predictive validity for different subgroups.1
(http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06fns9a) This is referred to as intercept bias because the Y-axis intercept is different for the two groups.
But what about using separate regression lines for each subgroup? Would this solve the problem and rescue the test from criterion-related test bias? Opinions differ on this point. Although there is no doubt that separate regression equations would maximize predictive accuracy for the combined sample, whether this practice would produce test fairness is debated. We return to this issue later, when we discuss the relevance of social values to test fairness.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 47/79
FIGURE 6.8 Test Scores, Grades, and Single Regression Line for Two Hypothetical Large Subpopulations of College Students
The Scholastic Aptitude Test (now known as the Scholastic Assessment Test and discussed in a later chapter) has been analyzed by several researchers with regard to test bias in criterion-related validity (Cleary, Humphreys, Kendrick, & Wesman, 1975 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib311) ; Manning & Jackson, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1039) ). A consistent finding is that separate, parallel, regression lines are needed for African American and White examinees. For example, in one school the best regression equations for African American, White, and combined students were as follows:
FIGURE 6.9 Test Scores, Grades, and Parallel Regression Lines for Two Hypothetical Large Subpopulations of College Students
where Y is the predicted college grade point, V is the SAT Verbal score, and M is the SAT Mathematics score (Cleary et al., 1975 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib311)
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 48/79
, p. 29). The effect of using the White or the combined formula is to overpredict college grades for African American subjects based on SAT results. On the traditional four-point scale (A = 4, B = 3, etc.), the average amount of overpre-diction from 17 separate studies was .20 or one-fifth of a grade point (Manning & Jackson, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1039) ). What these results mean is open to debate, but it seems clear, at least, that the SAT and similar entrance examinations do not underpredict college grades for minorities.
The most peculiar regression outcome, known as slope bias, is depicted in Figure 6.10 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06fig10) . In this case, the regression lines for separate subgroups are not even parallel. Using a single regression line (the dotted line) for prediction might, therefore, result in both under- and overprediction of scores for selected subjects in both groups. Professional opinion would be unanimous in this case: This test possesses a high degree of test bias in criterion-related validity.
Bias in Construct Validity The reader will recall that the construct validity of a psychological test can be documented by diverse forms of evidence, including appropriate developmental patterns in test scores, theory-consistent intervention changes in test scores, and confirmatory factor analysis. Because construct validity is such a broad concept, the definition of bias in construct validity (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss33) requires a general statement amenable to research from a variety of viewpoints with a broad range of methods. Reynolds (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1359) ) offers the following definition:
Bias exists in regard to construct validity when a test is shown to measure different hypothetical traits (psychological constructs) for one group than for another; that is, differing interpretations of a common performance are shown to be appropriate as a function of ethnicity, gender, or another variable of interest, one typically but not necessarily nominal.
FIGURE 6.10 Test Scores, Grades, and Nonparallel Regression Lines for Two Hypothetical Large Subpopulations of College Students
From a practical standpoint, two straightforward criteria for nonbias flow from this definition (Reynolds & Brown, 1984a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1360) ).
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 49/79
If a test is nonbiased, then comparisons across relevant subpopulations should reveal a high degree of similarity for (1) the factorial structure of the test and (2) the rank order of item difficulties within the test. Let us examine these criteria in more detail.
An essential criterion of nonbias is that the factor structure of test scores should remain invariant across relevant subpopulations. Of course, even within the same subgroup, the factor structure of a test might differ between age groups, so it is important that we restrict our comparison to same-aged persons from relevant subpopulations. For same-aged subjects, a nonbiased test will possess the same factor structure across subgroups. In particular, for a nonbiased test the number of emergent factors and the factor loadings for items or subscales will be highly similar for relevant subpopulations.
In general, when the items or subscales of prominent ability and aptitude tests are factor-analyzed separately in White and minority samples, the same factors emerge in the relevant subpopulations (Reynolds, 1982; Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) , 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib834) ). Although minor anomalies have been reported in a handful of studies (Scheuneman, 1987 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1463) ; Gutkin & Reynolds, 1981 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib674) ; Johnston & Bolen, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib845) ), research in this area is more notable for its consistent findings with respect to factorial invariance across subgroups (e.g., Geary & Whitworth, 1988 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib575) ).
A second criterion of nonbias in construct validity is that the rank order of item difficulties within a test should be highly similar for relevant subpopulations. Since age is a major determinant of item difficulty, this standard is usually checked separately for each age group covered by a test. The reader should note what this criterion does not specify. It does not specify that relevant subgroups must obtain equivalent passing rates for test items. What is essential is that the items that are the most difficult (or least difficult) for one subgroup should be the most difficult (or least difficult) for other relevant subpopulations.
The criterion of similar rank order of item difficulties can be tested in a very straightforward and objective manner. If the difficulty level of each item is computed by means of the p value (percentage passing) for each relevant subpopulation, then it is possible to compare the relative item difficulties across same-aged subgroups. In fact, the similarity of the rank order of item difficulties for any two groups can be gauged objectively by means of a correlation coefficient (rxy). The paired p values for the test items constitute the values of x and y used in the computation. The closer the value of r to 1.00, the more similar the rank ordering of item difficulties for the two groups.
In general, cross-group comparisons of relative item difficulties for prominent aptitude and ability tests have yielded correlations bordering on 1.00; that is, most tests show extremely similar rank orderings for item difficulties across relevant subpopulations (Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ; Reynolds, 1982). In a representative study, Miele (1979 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1143) ) investigated the relative item difficulties of the WISC for African American and White subjects at each of four grade levels (preschool, first, third, and fifth grades). He found that the average cross-racial correlations (holding grade level constant) for WISC item p values was .96 for males and .95 for females. These values were hardly
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 50/79
different from the cross-sex correlations (holding grade level constant) within race, which were .98 (Whites) and .97 (African Americans). As noted, these findings are not unusual.
In general, for mainstream cognitive tests, the rank order of item difficulties is nearly identical for relevant subpopulations, including minority groups. However, some exceptions have been noted. For example, Urquhart-Hagie, Gallipo, and Svien (2003 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1680) ) report some striking examples of apparent item bias in a WISC-III study of 28 teenage children on the Lakota Sioux reservation in South Dakota. These authors computed the passing rates for the WISC-III subtest items and found dramatic deviations in the relative difficulty levels of consecutive items on a few of the subtests. For example, consider the Information subtest, which consists of 30 items ranked from very easy (nearly 100% passing rate) to very hard (less than 1% passing rate). These items evaluate the child’s fund of basic information, with questions on a par with “How many legs does a cat have?” (easy) or “Which continent includes Argentina?” (medium) or “Who is the Dalai Lama?” (hard). The problem noted by Urquhart-Hagie et al. (2003 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1680) ) on the Information subtest is that item 13 was passed at a substantially lower rate than expected. Specifically, the percentage of the sample passing items 11 through 15 did not show a smooth decline, as would be found in a nonbiased test:
Item Number Percent Passing
Item 11 81
Item 12 61
Item 13 16
Item 14 45
Item 15 31
Item 13 reveals clear evidence of bias in construct validity—it is substantially more difficult than the preceding and following items. We cannot reveal the content of these copyrighted test items. However, we can say that item 13 requires the child to know about a well-known Italian explorer who reputedly discovered America. Actually, which foreigners first landed on American shores is an item of dispute—but that is another issue (Menzies, 2003 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1130) ). What is clear in this case is that item 13 on the WISC-III Information subtest requires knowledge that is unpalatable to most Native American examinees. The explorer in question is not a revered figure in this subculture. As Gregory (2009 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib651) ) notes:
We can well imagine the confusion of these indigenous people who have been on this continent for many thousands of years trying to fathom the notion that a European “discovered” their land.
In fairness, we should mention that clear examples of psychometrically confirmed test bias such as this are not common in published literature. Even so, this example serves as a reminder that ongoing investigations of test bias are still needed.
Reprise on Test Bias Critics who hypothesize that tests are biased against minorities assert that the test scores underestimate the ability of minority members. As we have argued in the preceding sections, the hypothesis of test bias is a scientific question that can be answered empirically through such procedures as factor analysis,
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 51/79
regression equations, intergroup comparisons of the difficulty levels for “biased” versus “unbiased” items, and rank ordering of item difficulties. In general, most investigators have found by these criteria that major ability and aptitude tests lack bias (Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ; Reynolds, 1994a; Kuncel & Sackett, 2007 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib929) ; Sackett, Borneman, & Connelly, 2008 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1427) ).
Recently, however, Aguinis, Culpepper, and Pierce (2010 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib11) ) have called into question the prevailing wisdom, using a complex statistical simulation to demonstrate that tests of bias are themselves biased. Their method, called Monte Carlo simulation, is beyond the scope of coverage here. They deduced that most studies of slope bias (rarely found in bias studies) do not possess sufficient statistical power to detect it. As noted earlier, slope bias results in the overprediction and underprediction of minority performance at different levels of the predictor variable. They also conclude that most studies of intercept bias (often found in bias studies, favoring minorities) are the result of a complex statistical artifact. Intercept bias is the systematic overprediction of scores for one group at all levels of the predictor variable. They conclude:
We are aware that we have set a tall-order goal of reviving research on test bias in pre-employment testing in the face of established conclusions in the fields of I/O psychology, management, and others concerned with high-stakes testing. Our results indicate that the accepted procedure to assess test bias is itself biased: Slope-based bias is likely to go undetected and intercept based bias favoring minority group members is likely to be found when in fact it does not exist (Aguinis et al., 2010 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib11) , p. 653).
The authors call for a renewal of interest in research on test bias in high-stakes testing and suggest methods to improve research in this area, including the use of power analysis to determine sample sizes needed for valid inferences about differential prediction.
Analyses of test bias focus mainly on the statistical properties of selected instruments, looking for differential validity in the application of tests with minority examinees. But it is good to remember that potential bias does not reside solely within the qualities of the testing instrument. Bias can arise within the complexities of clinical interactions, especially when cultural differences exist between the practitioner and the client. The choice of a test and the timing of its application may impact the validity of the results, as we illustrate in Case Exhibit 6.1 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec7#ch06exh1)
CASE EXHIBIT 6.1
The Impact of Culture on Testing Bias
The most commonly used tests of cognitive functioning come from the United States or western European nations. These instruments embody a Western perspective, with a focus on skills valued in urban and industrial settings (Poortinga & Van de Vijver, 2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1313) ). But culture impacts more than just test content, culture also shapes our understanding of the assessment process itself. For example, most Westerners recognize that the purpose of consultation with a health care professional is to convey useful information to the practitioner. They know that the practitioner will
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 52/79
conduct needed tests or procedures to help identify appropriate interventions. An implicit social contract guides the understanding of all parties.
But not every culture has the same understanding of this practitioner–patient covenant. We consider here the case of Mr. Kim, a 70-year-old man brought to a Latina psychologist by his daughter (Hayes, 2008 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib723) ). Mr. Kim was a second-generation Korean referred by his physician because of concerns about “memory loss.” The psychologist—we will call her Dr. Santiago—met initially with Mr. Kim and his adult daughter, Insook. The daughter seemed thoroughly acculturated to the United States, readily offering her thoughts. In contrast, Mr. Kim seemed more traditionally Korean, spoke rarely, and then in a low voice with a slight accent. He seldom made eye contact. Dr. Santiago made the cultural mistake of beginning the consultation by directing questions to Insook, an affront to Mr. Kim. In many Asian cultures, elderly persons expect to be treated with dignity and reverence, especially by their children (Kim, Kim, & Rue, 1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib890) ). The psychologist sensed that something was amiss, and switched to interviewing Mr. Kim directly. She asked if he experienced memory difficulties. He responded in a barely audible voice that he noticed “some” but that his daughter was “too bothered.”
At this point in the consultation, many psychologists would wonder if Mr. Kim was experiencing the onset of dementia. Typically, the practitioner might want to assess the mental status of the patient, perhaps using a test with good sensitivity and specificity like the Mini-Mental State Exam (Folstein, Folstein, & McHugh, 1975 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib520) ). This is a simple measure with 30 scorable items of orientation, memory, and other cognitive skills. It is so easy that normal adults score in the range of 27 to 30 points. But Dr. Santiago resisted the temptation to jump straight into testing, recognizing that Mr. Kim likely would be further alienated and perform poorly for cultural reasons, regardless of his cognitive status.
Instead of administering a test that would yield invalid and biased results, the psychologist chose to offer tea to Mr. Kim and his daughter. Afterward, she engaged Mr. Kim alone in a socially oriented conversation about his extended family, looking for signs of cognitive impairment such as word-finding problems, confusion, or difficulty staying on topic. Within this relaxed atmosphere, a better picture of his performance emerged. His cognitive slips were minor, yet his mood conveyed deep and abiding sadness. Dr. Santiago suspected that Mr. Kim suffered from depression, which can cause significant cognitive impairment, especially in the elderly (Reppermund, Brodaty, Crawford, and others, 2011 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1351) ). She offered no conclusions from this first consultation, but left the door open for further assessment of Mr. Kim. In the meantime, she planned to confer with an experienced Korean American psychologist.
An important lesson from this case is that the cultural background of the patient impacts the suitability, validity, and bias of assessment methods. An instrument appropriate in one context may yield invalid, biased results in a different cultural milieu. Hayes (2008 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib723) ) concluded that
the psychologist initially misinterpreted the father’s emotional restraint, lesser eye contact, and apparent acceptance of his difficulties as signs of dementia. She later learned that Mr. Kim’s demeanor is not uncommon among people of Korean and Buddhist cultures, for whom emotional restraint is often seen as a sign of maturity and problems are considered a fact of life (p. 145).
Sometimes choosing not to administer an ostensibly suitable test is the proper course of action, the necessary antidote to bias in testing.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 53/79
We turn now to the broader concept of test fairness. How well do existing instruments meet reasonable criteria of test fairness? As the reader will learn, test fairness (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss329) involves social values and is, therefore, an altogether more debatable—and more debated—topic than test bias.
1Contrary to widely held belief, test bias in these cases actually favors the lower-scoring group because its performance on the criterion is overpredicted. On occasion, then, test bias can favor minority groups.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 54/79
6.8 SOCIAL VALUES AND TEST FAIRNESS Even an unbiased test might still be deemed unfair because of the social consequences of using it for selection decisions. In contrast to the narrow, objective notion of test bias, the concept of test fairness incorporates social values and philosophies of test use. We will demonstrate to the reader that, in the final analysis, the proper application of psychological tests is essentially an ethical conclusion that cannot be established on objective grounds alone.
In a classic article that deserves detailed scrutiny, Hunter and Schmidt (1976 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib802) ) proposed the first clear distinction between statistical definitions of test bias and social conceptions of test fairness. Although the authors reviewed the usual technical criteria of test bias with incisive precision, their article is most famous for its description of three mutually incompatible ethical positions that can and should affect test use.
Hunter and Schmidt (1976 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib802) ) noted that psychological tests are often used for institutional selection procedures such as employment or college admission. In this context, the application of test results must be guided by a philosophy of selection. Unfortunately, in many institutions the selection philosophy is implicit, not explicit. Nonetheless, when underlying values are made explicit, three ethical positions can be distinguished. These positions are unqualified individualism, quotas, and qualified individualism. Since these ethical stances are at the very core of public concerns about test fairness, we will review these positions in some detail.
Unqualified Individualism In the American tradition of free and open competition, the ethical stance of unqualified individualism (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss339) dictates that, without exception, the best qualified candidates should be selected for employment, admission, or other privilege. Hunter and Schmidt (1976 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib802) ) spell out the implications of this position:
Couched in the language of institutional selection procedures, this means that an organization should use whatever information it possesses to make a scientifically valid prediction of each individual’s performance and always select those with the highest predicted performance. This position looks appealing at first glance, but embraces some implications that most persons find troublesome. In particular, if race, sex, or ethnic group membership contributed to valid prediction of performance in a given situation over and above the contributions of test scores, then those who espouse unqualified individualism would be ethically bound to use such a predictor.
Quotas The ethical stance of quotas (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss264) acknowledges that many bureaucracies and educational institutions owe their very existence to the city or state in which they function. Since they exist at the will of the people, it can be argued that these institutions are ethically bound to act in a manner that is “politically appropriate” to their location. The logical consequence of this position is quotas. For example, in a location whose population is one-third African American and two-
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 55/79
thirds White, selection procedures should admit candidates in approximately the same ratio. A selection procedure that deviates consistently from this standard would be considered unfair.
By definition, fair share quotas are based initially upon population percentages. Within relevant subpopulations, factors that predict future performance such as test scores would then be considered. However, one consequence of quotas is that those selected do not necessarily have the highest scores on the predictor test.
Qualified Individualism Qualified individualism (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss263) is a radical variant of individualism:
This position notes that America is constitutionally opposed to discrimination on the basis of race, religion, national origin, or sex. A qualified individualist interprets this as an ethical imperative to refuse to use race, sex, and so on, as a predictor even if it were in fact scientifically valid to do so. (Hunter & Schmidt, 1976 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib802) )
For selection purposes, the qualified individualist would rely exclusively on tested abilities, without reference to age, sex, race, or other demographic characteristics. This seems laudable, but examine the potential consequences. Suppose a qualified individualist used SAT scores for purposes of college admission. Even though SAT scores for African Americans and Whites produce separate regression lines for the criterion of college grades, the qualified individualist would be ethically bound to use the single, less-accurate regression line derived for the entire sample of applicants. As a consequence, the future performance of African Americans would be overpredicted, which would seemingly boost the proportion of persons selected from this applicant group. With respect to selection ratios, the practical impact of qualified individualism is therefore midway between quotas and unqualified individualism.
Reprise on Test Fairness Which philosophy of selection is correct? The truth is, this problem is beyond the scope of rational solution. At one time or another, each of the ethical stances outlined previously has been championed by wise, respected, and thoughtful citizens. However, no consensus has emerged, and one is not likely to be found soon. The dispute reviewed here
is typical of ethical arguments—the resolution depends in part on irreconcilable values. Furthermore, even among those who agree on values there will be disagreements about the validity of certain relevant scientific theories that are not yet adequately tested. Thus, we feel that there is no way that this dispute can be objectively resolved. Each person must choose as he sees fit (and in fact we are divided). (Hunter & Schmidt, 1976 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib802) )
When ethical stances clash—as they most certainly do in the application of psychological tests to selection decisions—the court system may become the final arbiter, as discussed later in this book.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 56/79
6.9 GENETIC AND ENVIRONMENTAL DETERMINANTS OF INTELLIGENCE
Genetic Contributions to Intelligence The nature–nurture debate regarding intelligence is a well-known and overworked controversy that we will largely sidestep here. We concur with McGue, Bouchard, Iacono, and Lykken (1993 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1097) ) that a substantial genetic component to intelligence has been proved by decades of adoption studies, familial research, and twin projects, even though individual studies may be faulted for particular reasons:
When taken in aggregate, twin, family, and adoption studies of IQ provide a demonstration of the existence of genetic influences on IQ as good as can be achieved in the behavioral sciences with nonexperimental methods. Without positing the existence of genetic influences, it simply is not possible to give a credible account for the consistently greater IQ similarity among monozygotic (MZ) twins than among like-sex dizygotic (DZ) twins, the significant IQ correlations among biological relatives even when they are reared apart, and the strong association between the magnitude of the familial IQ correlation and the degree of genetic relatedness. (p. 60)
Of course, the demonstration of substantial genetic influence for a trait does not imply that heredity alone is responsible for differences between individuals—environmental factors are formative, too, as reviewed subsequently.
The genetic contribution to human characteristics such as intelligence (as measured by IQ tests) is usually measured in terms of a heritability index that can vary from 0.0 to 1.0. The heritability index (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss143) is an estimate of how much of the total variance in a given trait is due to genetic factors. Heritability of 0.0 means that genetic factors make no contribution to the variance in a trait, whereas heritability of 1.0 means that genetic factors are exclusively responsible for the variance in a trait. Of course, for most measurable characteristics, heritability is somewhere between the two extremes. McGue et al. (1993 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1097) ) discuss the various methods for computing heritability based on twin and adoption studies.
It is important to stress that heritability is a population statistic that cannot be extended to explain an individual score. Furthermore, heritability for a given trait is not a constant. As Jensen (1969 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib829) ) notes, estimates of heritability “are specific to the population sampled, the point in time, how the measurements were made, and the particular test used to obtain the measurements.” For IQ, most studies report heritability estimates right around .50, meaning that about half of the variability in IQ scores is from genetic factors. For some studies, the heritability of IQ is much higher, in the .70s (Bouchard, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib186) ; Bouchard, Lykken, McGue, Segal, & Tellegen, 1990 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib187) ; Pedersen, Plomin, Nesselroade, & McClearn, 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1277) ).
Yet, the heritability of IQ defies any simple summary. For one thing, genetic influence on IQ appears to demonstrate an interaction effect with socioeconomic status (SES). Turkheimer et al. (2003 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1672) ) studied IQ results for 7-year-old twins, many living at or below the poverty level, others reared in middle class or higher
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 57/79
families. The proportion of variance in IQ accounted for by genetic factors was inferred from the similarities/differences in IQ scores of identical versus fraternal twins. For families with the lowest levels of SES, environmental factors accounted for almost all of the variation in IQ. But in families with the highest levels of SES (middle and upper class), genetic factors accounted for almost all of the variation in IQ. These striking results have been only partially confirmed by other twin studies. The interaction effect is minimal in studies conducted in other countries (Nisbett et al. 2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1239) ).
If genuine, as appears to be the case in the United States, the interaction between SES and heritability, with IQ revealing little genetic influence for low SES children, carries important policy implications:
One interpretation of the finding that heritability of IQ is very low for lower SES individuals is that children in poverty do not get to develop their full genetic potential. If true, there is room for interventions with that group to have large effects on IQ (Nisbett et al., 2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1239) , p. 134).
We investigate the impact of enriched environments such as early educational intervention in a later topic.
A most fascinating demonstration of the genetic contribution to IQ is found in the Minnesota Study of Twins Reared Apart (Segal, 2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1479) ). In this ongoing study, identical twins reared apart are reunited for extensive psychometric testing. Bouchard (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib186) ) reports that the IQs of identical twins reared apart correlate almost as highly as those of identical twins reared together, even though the twins reared apart often were exposed to different environmental conditions (in some cases, sharply contrasting environments). In sum, differences in environment appeared to cause very little divergence in the IQs of identical twin pairs reared apart. These findings strongly suggest a genetic contribution to intelligence, with heritability estimated in the vicinity of .70.
The Minnesota Study and other twin studies have been criticized on methodological and philosophical grounds. Methodologically, one concern is that identical twins separated early in life for adoption might be placed in highly similar environments, which would inflate the estimated genetic influence when reunited and tested in adulthood. Philosophically, some skeptics question the utility and purpose of churning out one heritability estimate after another:
It is not apparent what scientific purposes are served by the sustained flow of heritability numbers for psychological characteristics. Perhaps molecular geneticists need those numbers to guide their search for the underlying genes? Perhaps clinical psychologists need those numbers to guide their selection of therapies that work? Or perhaps educators need those numbers to guide their choice of teaching interventions that will be successful? We have seen no indication of the usefulness of the heritability numbers for any of those purposes. Indeed, it has been widely recognized that malleability is not the opposite of heritability. (Kamin & Goldberger, 2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib853) , p. 28)
In sum, traits with high heritability might still prove to be malleable in the face of environmental factors. If this is so, what constructive purpose is served by the flood of heritability estimates found in the research literature?
Thus, we must avoid the tendency to view any corpus of research in a simplistic either/or frame of mind. Even the most diehard hereditarians acknowledge that a person’s intelligence is shaped also by the quality
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 58/79
of experience. The crucial question is: To what extent can enriched or deprived environments modify intelligence upward or downward from the genetically circumscribed potential? The reader is reminded that the genetic contribution to intelligence is indirect, most likely via the gene-coded physical structures of the brain and nervous system. Nonetheless, the brain is quite malleable in the face of environmental manipulations, which can even alter its weight and the richness of neuronal networks (Greenough, Black, & Wallace, 1987 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib644) ). How much can such environmental impacts sway intelligence as measured by IQ tests? We will review several studies indicating that environmental extremes help determine intellectual outcome within a range of approximately 20 IQ points, perhaps more.
Environmental Effects: Impoverishment and Enrichment First, we examine the effects of environmental disadvantage. Vernon (1979 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1698) , chap. 9) has reviewed the early studies of severe deprivation, noting that children reared under conditions in which they received little or no human contacts can show striking improvements in IQ—as much as 30 to 50 points—when transferred to a more normal environment. Yet, we must regard this body of research with some skepticism, owing to the typically exceptional conditions under which the initial tests were administered. Can a meaningful test be administered to 7-year-old children raised almost like animals (Koluchova, 1972)?
Typical of this early research is the follow-up study by Skeels (1966) of 25 orphaned children originally diagnosed as having mental retardation (Skeels & Dye, 1939). These children were first tested at approximately 1½ years of age when living in a highly unstimulating orphanage. Thirteen of them were then transferred to another home where they received a great deal of supervised, doting attention from older girls with mental retardation. These children showed a considerable increase in IQ, whereas the 12 who remained behind decreased further in IQ. When traced at follow-up 26 years later, the 13 transferred cases were normal, self-supporting adults, or were married. The other subjects—the contrast group— were still institutionalized or in menial jobs. The enriched group showed an average increase of 32 IQ points when retested with the Stanford-Binet, whereas the contrast group fell below their original scores. Even though we are disinclined to place much credence in the original IQ scores and might, therefore, quarrel with the exact magnitude of the change, the Skeels (1966) study surely indicates that the difference between a severely depriving early environment and a more normal one might account for perhaps 15 to 20 IQ points.
More recently, Breslau, Chilcoat, Susser, and others (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib216) ) conducted a rigorous longitudinal study that illustrates the detrimental impact of growing up in a racially segregated and economically disadvantaged community. Using the WISC-R, they collected longitudinal IQ scores at age 6 and age 11 for large samples of urban and suburban children, some low birth weight (≤2500 grams) and some normal birth weight (>2500 grams). The urban samples were primarily Black, from inner city Detroit, and reared by a single mother with high school (or less) education. These children typically experienced economic deprivation, inferior education, family stress, and racial segregation. The suburban samples were primarily White, from economically advantaged communities, and reared by a married mother with college education. As the authors note, “the sampling design provided for a comparison of populations with starkly contrasting social conditions.” (p. 712)
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 59/79
FIGURE 6.11 Average IQ Scores for Urban and Suburban Children at Age 6 and Age 11
S-N: Suburban Normal Birth Weight
S-L: Suburban Low Birth Weight U-N: Urban Normal Birth Weight U-L: Urban Low Birth Weight
Source: Based on data in Breslau, N., Chilcoat, H., Susser, E., and others (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib216) ). Stability and change in children’s Intelligence Quotient scores: A comparison of two socioeconomically disparate communities. American Journal of Epidemiology, 154, 711– 717.
The mean IQ scores for all samples at both times of testing (age 6 and age 11) are depicted in Figure 6.11 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec9#ch06fig11) . The reader will observe that suburban samples scored higher than inner city samples, and that normal birth weight children scored higher than low-birth-weight children. These results are not especially remarkable—the negative impacts of low birth weight and economic disadvantage are well documented in the literature on group differences in IQ outcomes (e.g., Breslau, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib215) ; Ceci, 1996 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib288) ). What is noteworthy about the results—one might even say astonishing—is that both of the inner city samples (low birth weight and normal birth weight) apparently lost an average of 5 IQ points during the five years between initial testing at age 6 and follow-up testing at age 11. In contrast, the suburban samples held constant in IQ during the same time period. It is difficult to conceive a benign explanation for these findings. Apparently, growing up in the poverty, segregation, and turmoil of the inner city imposes
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 60/79
hardships that lead to a decline in IQ scores from age 6 to age 11. The authors summarize the significance of their study as follows:
On average, the IQs of urban children declined by more than 5 points. A change of 5 points in an individual child might be judged by some as clinically nonsignificant. Nevertheless, a change of this size in a population’s mean IQ, which reflects a downward shift in the distribution (rather than a change in the shape of the distribution), means that the proportion of children scoring 1 standard deviation or more below the standardized IQ mean of 100 would increase substantially. In this study, the change from age 6 to age 11 years increased the percentage of urban children scoring less than 85 on the WISC-R from 22.2 to 33.2. (Breslau et al., 2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib216) , p. 716)
Sadly, the apparent drop of 5 points in average IQ from age 6 to age 11 found in this study may represent only part of the overall impact of environmental deprivation. The full effect over a lifetime could be substantially greater.
Jensen (1977 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib830) ) found similar results in a methodologically novel study of severely impoverished African American children in rural Georgia. Comparing older and younger siblings on the California Test of Mental Maturity (CTMM), he found that children from this setting, which was “as severely disadvantaged, educationally and economically, as can be found anywhere in the United States,” appeared to lose up to one IQ point a year, on average, between the ages of 6 and 16. The cumulative loss totaled 5 to 10 IQ points. Furthermore, if we factor in the probable IQ deficit that occurred between birth and age 5, we can surmise that the overall effect of a depriving environment is probably more than the 5- to 10-point IQ decrement reported by Jensen (1977 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib830) ).
Scarr and Weinberg (1976 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1445) , 1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1446) ) reversed the question probed by Jensen (1977 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib830) ), namely, they asked: What happens to their intelligence when African American children are adopted into the relatively enriched environment provided by economically and educationally advantaged White families? As discussed later, it is well known that African American children reared by their own families obtain IQ scores that average about 15 points below Whites (Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ). Some portion of this difference—perhaps all of it—is likely due to the many social, economic, and cultural differences between the two groups. We put that issue aside for now. Instead, we pursue a related question that bears on the malleability of IQ: What difference does it make when African American children are adopted into a more economically and educationally advantaged environment?
Scarr and Weinberg (1976 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1445) , 1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1446) ) found that 130 African American and interracial children adopted into upper-middle-class White families averaged a Full Scale IQ of 106 on the Stanford-Binet or the WISC, a full 6 points higher than the national average and some 18 to 21 points higher than typically found with African American examinees. African American children adopted early in life, before 1 year of age, fared even better, with a mean IQ of 110. We can only wonder what the IQ scores would have been if the adoptions had taken place at birth and if excellent
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 61/79
prenatal care had been provided. This study indicates that when the early environment is optimal, IQ can be boosted by perhaps 20 points.
Limitations of space prevent us from further detailed discussion of environmental effects on IQ. It is worth noting, though, that a huge literature has emerged from early intervention and enrichment-stimulation studies of children at risk for school failure and mental retardation (e.g., Barnett & Camilli, 2002 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib92) ; Ramey & Ramey, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1330) ). In general, these studies show that intervention and enrichment can boost IQ in children at risk for school failure and mental retardation. Summarizing four decades of research, Ramey and Ramey (1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1330) ) extracted six principles from the research on early intervention for at-risk children. They refer to these as “remarkable consistencies in the major findings” on intervention studies:
Interventions that begin earlier (e.g., during infancy) and continue longer provide the best benefits to participating children. More-intensive interventions (e.g., number of visits per week) produce larger positive effects than less-intensive interventions. Direct enrichment experiences (e.g., working directly with the kids) provide greater impact than indirect experiences. Programs with comprehensive services (e.g., multiple enhancements) produce greater positive changes than those with a narrow focus. Some children (e.g., those with normal birth weight) show greater benefits from participation than other children. Initial positive benefits diminish over time if the child’s environment does not encourage positive attitudes and continued learning.
One concern about early intervention programs is their cost, which has been excessive for some of the demonstration projects. Skeptics wonder about the practicality and also the ultimate payoff of providing extensive, broad-based, continuing intervention virtually from birth onward for the millions of children at risk for developmental problems. This is a realistic concern because “relatively few early intervention programs have received long-term follow-up” (Ramey & Ramey, 1998 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1330) ). Critics also wonder if the programs merely teach children how to take tests without affecting their underlying intelligence very much (Jensen, 1981 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib833) ). Finally, there is the issue of cultural congruence. Intervention programs are mainly designed by White psychologists and then applied disproportionately to minority children. This is a concern because programs need to be culturally relevant and welcomed by the consumers, otherwise the interventions are doomed to failure.
One popular intervention program is Head Start, created in 1965 and funded continuously by the federal government. The original program provided comprehensive services for children 3 to 5 years of age. In 1995, with the inception of Early Head Start under President Clinton, coverage was expanded to children from birth to 5 years of age. In 2012, funding for Head Start was approximately $8 billion. These funds provided a broad range of services including preschool education centers for low-income families, child care homes, medical and dental services, and home-based consultation by developmental experts. Over one million infants and children receive Head Start services each year. Low-income pregnant women also are eligible for services. Interventions are designed to be culturally sensitive and involve the parents as
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 62/79
much as possible. School readiness is the overriding goal, which is facilitated through the support of cognitive, language, physical, social, and emotional development.
Zhai, Brooks-Gunn, and Waldfogel (2011 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1803) ) recently completed a study of school readiness in 2,803 Head Start children from 18 cities. When compared with children from any other child care arrangement, children in Head Start demonstrated, at age 5, gains in cognitive development as measured by the Peabody Picture Vocabulary Test-III and a letter-word identification task, improvements in social competence as measured by a subscale from the Adaptive Social Behavior Inventory (Hogan, Scott, & Bauer, 1992 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib755) ), and reductions in their attention problems as measured by a subscale from the Child Behavior Checklist (CBCL, Achenbach & Rescorla, 2000 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib07) ). There were no statistically significant effects on internalizing or externalizing behavior problems on the CBCL. The researchers emphasize that Head Start impacts more than cognitive development. It also enhances attentional and emotional skills essential for school readiness.
Teratogenic Effects on Intelligence and Development In normal prenatal development, the fetus is protected from the external environment by the placenta, a vascular organ in the uterus through which the fetus is nourished. However, some substances known as teratogens (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss325) cross the placental barrier and cause physical deformities in the fetus. Especially if the deformities involve the brain, teratogens may produce lifelong behavioral disorders, including low IQ and mental retardation. The list of potential teratogens is almost endless and includes prescription drugs, hormones, illicit drugs, smoking, alcohol, radiation, toxic chemicals, and viral infections (Berk, 1989; Martin, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1047) ). We will briefly highlight the most prevalent and also the most preventable teratogen of all, alcohol.
Heavy drinking by pregnant women causes their offspring to be at very high risk for fetal alcohol syndrome (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss120) (FAS), a specific cluster of abnormalities first described by Jones, Smith, Ulleland, and Streissguth (1973 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib847) ). Intelligence is markedly lower in children with FAS. When assessed in adolescence or adulthood, about half of all persons with this disorder score in the range of mental retardation on IQ tests (Olson, 1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1254) ). Prenatal exposure to alcohol is one of the leading known causes of mental retardation in the Western world. The defining criteria of FAS include the following:
Prenatal and/or postnatal growth retardation—weight below the tenth percentile after correcting for gestational age Central nervous system dysfunction—skull or brain malformations, mild to moderate mental retardation, neurological abnormalities, and behavior problems Facial dysmorphology—widely spaced eyes, short eyelid openings, small up-turned nose, thin upper lip, and minor ear deformities (Sokol & Clarren, 1989 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1521) )
The full-blown FAS syndrome occurs mainly in off-spring of women alcoholics—those who ingest many drinks per occasion.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 63/79
Children exposed to lesser levels of alcohol during pregnancy may manifest a range of consequences known collectively as Fetal Alcohol Spectrum Disorder (FASD) (Bertrand, Floyd, Weber, and others, 2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib155) ). FASD is an unofficial umbrella term that encompasses the entire range of adverse consequences. These outcomes include full-blown FAS, the most devastating result of prenatal exposure to alcohol, and other manifestations referred to with terms such as fetal alcohol effect, alcohol-related neurodevelopmental disorder, and similar designations. Even though the existence of adverse effects from prenatal exposure to low or moderate drinking is still disputed (Abel, 2009 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib03) ), the best advice to pregnant women is to refrain entirely from alcohol. A child with FASD might function in the borderline range of intelligence and manifest poor coordination, difficulty with concept formation, hyperactivity, and problems with executive functions. In the absence of intervention, the consequences to the child, the family, and society are profound, as confirmed by Streissguth, Bookstein, Barr, and others (2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1590) ). They studied 415 children and adults with confirmed FASD, searching patient records and interviewing knowledgeable informants. The median IQ of the group was 86, with a range of 29 to 126. Most were young (median age of 14, range 6 to 51), but many had reached adolescence and adulthood. For these older individuals, 60 percent had experienced trouble with the law, 50 percent had been in a jail, prison, or inpatient setting, 49 percent had engaged in inappropriate sexual behaviors, and 35 percent experienced alcohol or drug problems. In spite of these markers of turmoil and social disruption, early diagnosis of FASD and placement in a stable environment dramatically reduced the likelihood of these adverse outcomes.
FASD likely is more common than previously thought. According to the Centers for Disease Control and Prevention (CDC, 2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib290) ), 7.6 percent of pregnant women report using alcohol, including 1.4% who engage in binge drinking (6 or more drinks per occasion). These data probably underestimate alcohol intake during pregnancy, because some women will be reluctant to report honestly on their drinking. Clearly, a small proportion of pregnant women continue to drink, in spite of widespread public health warnings. As a result, FASD persists as a public health problem.
Many affected children do not show the characteristic facial anomalies and therefore never receive proper diagnosis and early intervention. In a thorough study of elementary school children in two counties in Washington State, Clarren, Randels, Sanderson, and Fineman (2001 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib310) ) found that only 1 in 7 children with FAS had been previously diagnosed. Based on epidemiological findings and the convergence of evidence from several research methods, May, Gossage, Kalberg, and others (2009 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1069) ) concluded that the current prevalence of FASD among younger school children may be as high as 2 to 5 percent in the United States and some western European countries. The social, health, and economic consequences of these estimated prevalence rates are cause for concern.
Effects of Environmental Toxins on Intelligence Many industrial chemicals and by-products may impair the nervous system temporarily, or even cause permanent damage that affects intelligence. Examples include lead, mercury, manganese, arsenic, thallium, tetra-ethyl lead, organic mercury compounds, methyl bromide, and carbon disulphide (Lishman, 1997 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib999) ). Long-term exposure to organophosphate pesticides such as encountered by some farm workers is known to cause neurobehavioral deficits in memory, fine motor control, response speed, and mental flexibility
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 64/79
(Mackenzie Ross, Brewin, & Curran, 2010 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1028) ; Roldán-Tapia, Parrón, & Sánchez-Santed, 2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1403) ). Certainly, the most widely studied of these environmental toxins is lead, which we examine in modest detail here.
Sources of human lead absorption include eating of lead-pigmented paint chips by infants and toddlers; breathing of particulate lead from smelter emissions; eating of food from lead-soldered cans or lead- glazed pottery; and the drinking of water that has passed through lead pipes. Because the human body excretes lead slowly, most citizens of the industrialized world carry a lead burden substantially higher— perhaps 500 times higher—than known in pre-Roman times (Patterson, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1269) ).
The hazards of high-level lead exposure are acknowledged by every medical and psychological researcher who has investigated this topic. High doses of lead are irrefutably linked to cerebral palsy, seizure disorders, blindness, mental retardation, even death. The more important question pertains to “asymptomatic” lead exposure: Can a level of absorption that is insufficient to cause obvious medical symptoms nonetheless produce a decrement in intellectual abilities?
Research findings on this topic are complex and controversial. Using tooth lead from shed teeth of young children as their index of cumulative lead burden, Needleman and associates (1979 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1220) ) reported that “asymptomatic” lead exposure was associated with decrements in overall intelligence (about 4 IQ points) and lowered performance on verbal subtests, auditory and speech processing tests, and a reaction time measure of attention. These differences persisted at follow-up 11 years later (Needleman, Schell, Bellinger, Leviton, & Allred, 1990 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1221) ). Yet, using a similar study method, Smith, Delves, Lansdown, Clayton, and Graham (1983 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1512) ) found a nonsignificant effect from children’s lead exposure when social factors such as the parents’ level of education and social status were controlled.
In part, research findings on this topic are contradictory because it is difficult to disentangle the effects of lead from those of poverty, stress, poor nutrition, and other confounding variables (Kaufmann, 2001a, b). Most likely, asymptomatic lead exposure has harmful effects on the nervous system that translate to reduced intelligence, impaired attention, and a host of other undesirable behavioral consequences.
Recent studies continue to raise alarm about the impact of very low levels of lead exposure on the behavioral and neurocognitive functioning of children. Marcus et al. (2010 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1041) ) completed a meta-analysis of 19 studies on lead (from hair samples) and behavior problems in 8,561 children. The average correlation across all studies was r = .19 (p < .001), that is, the higher the lead level, the greater the severity of conduct problems. Strayhorn and Strayhorn (2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1587) ) studied achievement scores in relation to elevated blood lead levels in children for the 57 counties of New York State, using family income as a covariate. Achievement scores were taken from state-wide English and mathematics testing conducted in the third and eighth grades. The partial correlations between incidence of elevated lead and number of children in the lowest achievement levels ranged from .29 to .40 (p < .05). The researchers found a direct linear relationship: for each one percent increase in children with lead
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 65/79
levels elevated beyond the official CDC limit, there was a corresponding one percent increase in children in the lowest achievement group.
These recent studies probably help explain why the CDC recently lowered the level of acceptable blood lead burden from 10 to 5 μg/dL, the first change in 20 years (New York Times, May 17, 2012, CDC lowers recommended lead-level limits in children). The current level, 5 μg/dL, is an exceedingly small level of exposure. One μg (microgram) is one-millionth of a gram, and a dL (deciliter) is one-tenth of a liter or almost half a cup.
In addition to the health burden from lead exposure, the overall national costs are substantial, as outlined in a recent social policy report from the Society for Research in Child Development (SRCD):
Children’s exposure to lead is expensive, incurring costs associated with health care and losses associated with lowered intellectual development, earnings, and tax contributions. One study put the overall cost of exposure in children 6 and under at $192 to $270 billion over six years. Another cost analysis concluded that reducing children’s blood lead levels just 1 μg/dL would save $7.56 billion annually (SRCD, 2010 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1520) , p. 2).
Prudence dictates that we should reduce lead exposure in humans to the lowest levels possible.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 66/79
6.10 ORIGINS AND TRENDS IN RACIAL IQ DIFFERENCES
Early Studies of African American and White IQ Differences Racial differences in IQ have been recorded since the beginnings of standardized testing. The most widely studied disparity is between African American and White samples, where a discrepancy favoring Whites of about one standard deviation (15 points) is historically reported. We should add that the term Black is used interchangeably with African American, and that White refers to non-Hispanic White individuals. The IQ difference fluctuates from one analysis to the next—as small as 10 points in a few studies but as large as 20 points in others. For example, in the 1960 restandardization of the Stanford-Binet, the White sample (M = 101.8) outscored the Black sample (M = 80.7) by slightly more than 20 IQ points (Kennedy, Van de Riet, & White, 1963 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib878) ). A lesser difference was revealed on the 1981 WAIS-R where Whites (M = 101.4) outscored Blacks (M = 86.9) by 14½ points (Reynolds, Chastain, Kaufman, & McLean, 1987 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1362) ). In the standardization sample for the fourth edition of the Stanford-Binet (Thorndike, Hagen, & Sattler, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1642) ) a difference of about 17½ points (mean of 103.5 versus 86.1) was observed. For these early studies, when demographic variables such as socioeconomic status are taken into account, the size of the mean difference reduces to .5 to .7 standard deviations (7 to 10 IQ points) but does not disappear (Reynolds & Brown, 1984a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1360) ). Put simply, the existence of race differences in IQ has been reported with such consistency that is it no longer the focus of serious dispute.
However, the interpretation of race differences in IQ is an issue of fierce ongoing debate. Why the disparity exists, what it means from a practical standpoint, and whether the gap is narrowing—all these topics engender a full range of opinions (Fagan & Holland, 2007; Rushton & Jensen, 2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1418) ). We begin our discussion with the question of origins—what are the causes of the Black-White IQ difference?
One viewpoint (discussed previously) is that the observed IQ disparity is caused, partly or wholly, by test bias. This is a popular and widely held viewpoint rarely supported by technical studies of test bias. Test bias may play a small role in race differences, but it cannot explain the persistent difference in IQ scores between Black and White Americans. Here we intend to examine a different hypothesis; namely: Is the IQ difference between Black and White Americans due, in significant part, to genetic sources?
The Genetic Hypothesis for Race Differences in IQ The hypothesis of a genetic basis for race differences in IQ first gained scholarly prominence in 1969 when Arthur Jensen published a provocative paper titled “How Much Can We Boost IQ and Scholastic Achievement?” (Jensen, 1969 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib829) ). Jensen set the tone for his paper in the opening sentence when he asserted that “compensatory education has been tried and it apparently has failed.” He further contended that compensatory education programs were based on two fallacious theoretical underpinnings, namely, the “average child concept,” which views children as more or less homogeneous, and the “social deprivation hypothesis,” which asserts that environmental deprivation is the primary cause of lowered achievement and IQ scores. Jensen argued forcefully against both suppositions. Furthermore, leaning heavily on the literature in behavior genetics, Jensen implied that the
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 67/79
reason Whites scored higher than African Americans on IQ tests was probably related more to genetic factors than to the effects of environmental deprivation. The thrust of his paper was to suggest that, since compensatory education has proved ineffectual, and since the evidence suggests a strong genetic component to IQ, therefore, it is appropriate to entertain a genetic explanation for the well-documented difference in favor of Whites on IQ tests. He formulated the genetic hypothesis in a careful, tentative, scholarly manner:
The fact that a reasonable hypothesis has not been rigorously proved does not mean that it should be summarily dismissed. It only means that we need more appropriate research for putting it to the test. I believe such definitive research is entirely possible but has not been done. So all we are left with are various lines of evidence, no one of which is definitive alone, but which, viewed all together, make it a not unreasonable hypothesis that genetic factors are strongly implicated in the average Negro-white intelligence difference. The preponderance of the evidence is, in my opinion, less consistent with a strictly environmental hypothesis than with a genetic hypothesis, which, of course, does not exclude the influence of environment or its interaction with genetic factors. (Jensen, 1969 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib829) )
With the articulation of a genetic hypothesis for race differences in IQ, Jensen provoked an intense debate that has raged on, with periodic lulls, to the present day.
In the mid-1990s the controversy over a genetic basis for race differences in IQ was intensified once again with the publication of The Bell Curve by Richard Herrnstein and Charles Murray (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib738) ). This massive tome was primarily a book about the importance of IQ as a predictor of poverty, school leaving, unemployment, illegitimacy, crime, and a host of other social pathologies. But two chapters on ethnic differences in intelligence caused an uproar among social scientists and the lay public. The authors reviewed dozens of studies and concluded that the IQ gap between African Americans and Whites has changed little in this century. They also argued that test bias cannot explain the race differences. Furthermore, they noted that races differ not just in average IQ scores but also in the profile of intellectual abilities. In addition, they concluded that intelligence is only slightly malleable even in the face of intensive environmental intervention. As did Jensen, Herrnstein and Murray (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib738) ) stated their genetic hypothesis with considerable circumspection:
It seems highly likely to us that both genes and the environment have something to do with racial differences [in cognitive ability]. What might the mix be? We are resolutely agnostic on that issue; as far as we can determine, the evidence does not yet justify an estimate.
Although the authors declined to provide an estimate of the genetic contribution to race differences in IQ, it is clear from the tone of their pessimistic book that they believe it to be substantial. Recently, Arthur Jensen has reentered the debate on the origins of IQ differences between African Americans and White Americans and reaffirmed his earlier judgment that the disparity is “partly heritable” (Rushton & Jensen, 2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1418) ). Is this conclusion warranted by the evidence?
Tenability of the Genetic Hypothesis The genetic hypothesis for race IQ differences is an unpopular idea that is anathema to many laypersons and social scientists. But contempt for an idea does not constitute disproof, and superficiality is no substitute for a reasoned examination of evidence. In light of additional analysis and research, is the
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 68/79
genetic hypothesis for IQ differences tenable? We will examine three lines of evidence here that indicate that the answer is “no.”
Several critics have pointed out that the genetic hypothesis is based on the questionable assumption that evidence of IQ heritability within groups can be used to infer heritability between racial groups. Jensen (1969 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib829) ) expressed this premise rather explicitly, pointing to the substantial genetic component in IQ as suggestive evidence that differences in IQ between African Americans and White Americans are, in part, genetically based. Echoing earlier critics, Kaufman (1990 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib862) ) responds as follows:
One cannot infer heritability between groups from studies that have provided evidence of the IQ’s heritability within groups. Even if IQ is equally heritable within the black and white races separately, that does not prove that the IQ differences between the races are genetic in origin. Scarr-Salapatek’s (1971 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1447) , p. 1226) simple example explains this point well: Plant two randomly drawn samples of seeds from a genetically heterogeneous population in two types of soil—good conditions versus poor conditions— and compare the heights of the fully grown plants. Within each type of soil, individual variations in the heights are genetically determined; but the average difference in height between the two samples is solely a function of environment.
Another criticism of the genetic hypothesis is that careful analysis of environmental factors provides a sufficient explanation of race differences in IQ, that is, the genetic hypothesis is simply unnecessary. This is the approach taken by Brooks-Gunn, Klebanov, and Duncan (1996 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib225) ) in a study of 483 African American and White low birth weight children. What makes their study different from other similar analyses is the richness of their data. Instead of using only one or two measures of the environment (e.g., a single index of poverty level), they collected longitudinal data on income level and many other cofactors of poverty such as length of hospital stay, maternal verbal ability, home learning environment, neighborhood condition, and other components of family social class. When the children’s IQs were tested at age 5 with the WPPSI, the researchers found the usual disparity between the White children (mean IQ of 103) and the African American children (mean IQ of 85). However, when poverty and its cofactors were statistically controlled, the IQ differences were almost completely eliminated. Their study suggests that previous research has underestimated the pervasive effects of poverty and its cofactors as a contribution to African American and White IQ differences.
A third criticism of the genetic hypothesis is that race as a biological entity is simply nonexistent, that is, there are no biological races. Fish (2002 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib505) ) and other proponents of this viewpoint argue that “race” is a socially constructed concept, not a biological reality:
Homo sapiens has no extant subspecies: There are no biological races. Human physical appearance varies gradually around the planet, with the most geographically distant peoples generally appearing the most different from one another. The concept of human biological races is a construction socially and historically localized to 17th and 18th-century European thought. Over time, different cultures have developed different sets (folk taxonomies) of socially defined “races.” (p. 29)
Put another way, racial categories are social constructions based on superficial physical differences (especially skin color) that serve cultural-psychological objectives (e.g., reducing uncertainty about how
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 69/79
we should respond to one another). However, racial categories do not signify meaningful biological differences. A biologist expresses the point this way: “All of humanity shares in common the vast majority of its molecular genetic variation and the adaptive traits that define us as a single species” (Templeton, 2002 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1626) , p. 51). Thus, insofar as race has no biological reality, the argument that “race” differences in IQ originate from a genetic basis is not only pernicious, it is also absurd. Neisser, Boodoo, Bouchard, and others (1996 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1223) ) offer additional perspectives on race differences in IQ and related topics.
Before leaving the topic of race differences in IQ, we should point out that the emotion attached to this topic is largely undeserved, for two reasons. First, racial groups always show large overlaps in IQ— meaning that the peoples of the earth are much more alike than they are different. Second, as previously noted, the existing race differences in IQ certainly reflect cultural differences and environmental factors to a substantial degree. Wilson (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1765) ) has catalogued the numerous differences in cultural background between African Americans and White Americans. In 1992, for example, 64 percent of African American parents were divorced, separated, widowed, or never married; 63 percent of African American births were to unmarried mothers; and 30 percent of African American births were to adolescents (U.S. Bureau of the Census, 1993). On average, these realities of family life for many African Americans inevitably will lead to lowered performance on intelligence tests. Lest the reader conclude that we are hereby endorsing a subtle form of Anglocentric superiority, consider Lynn’s (1987 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1020) ) conclusion that the mean IQ of the Japanese is 107, a full 7 points higher than the average for American Whites. So what?
Recent Trends in Race Differences in IQ An important question is whether Black–White IQ differences have remained stable over recent decades (which could support a genetic basis for the IQ disparity) or whether the gap has narrowed in response to environmental progress (which could indicate a substantial ecological source for the IQ disparity). The former conclusion (stability of the IQ difference) has been stated by Jensen and others who hypothesize, in part, a genetic basis for the discrepancy (Jensen, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib832) ; Jensen & Rushton, 2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1418) ).
In contrast, a recent analysis by Dickens and Flynn (2006 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib413) ) supports a significant narrowing of the racial IQ gap. These researchers considered comparative longitudinal data for Black and White examinees for the period 1970 to 2000 with successive editions of four carefully standardized instruments: The Stanford-Binet, the Wechsler Intelligence Scale for Children, the Wechsler Adult Intelligence Scale, and the Armed Forces Qualifying Test. Their findings are complex and statistically laden, but here is the big picture: on all four instruments, Blacks gained in IQ compared to Whites during 1970 to 2000, the average gain amounting to 4 to 7 IQ points. The authors conclude:
The constancy of the Black-White IQ gap is a myth and therefore cannot be cited as evidence that the racial IQ gap is genetic in origin. (p. 917) Overall, the average IQ for Black schoolchildren was estimated to be 90.5 in 2002, indicating that Black children have made large IQ gains relative to Whites since the 1960s. Dickens and Flynn (2006 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib413) ) conclude that
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 70/79
further Black economic progress would produce additional gains in IQ. This conclusion provides an optimistic outlook on a contentious social issue.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 71/79
6.11 AGE CHANGES IN INTELLIGENCE We turn now to another controversial topic—whether intelligence declines with age. Certainly, one of the most pervasive stereotypes about aging is that we lose intellectual ability as we grow older. This stereotype is so pervasive that few laypersons question it. But we should question it.
In general, the empirical study of this topic provides a more optimistic conclusion than the common stereotype suggests. However, the research also reveals that age changes in intelligence are complex and multifaceted. The simple question, “Does intelligence decline with age?” turns out to have several labyrinthine answers.
We can trace the evolution of research on age-related intellectual changes as follows:
Early cross-sectional research with instruments such as the WAIS painted a somber picture of a slow decline in general intelligence after age 15 or 20 and a precipitously accelerated descent after age 60. Just a few years later, more sophisticated studies using sequential testing with multidimensional instruments such as the Primary Mental Abilities Test suggested a more optimistic trajectory for intelligence: minimal change in most abilities until at least age 60. Parallel research utilizing the fluid/crystallized distinction posited a gradual increase in crystallized intelligence virtually to the end of life, juxtaposed against a rapid decline in fluid intelligence. Most recently, a few psychologists have proposed that adult intelligence is qualitatively different, akin to a new Piagetian stage that might be called postformal reasoning. This research calls into question the ecological validity of using standard instruments with older examinees.
We examine each of these research epochs in more detail in the following sections.
Early Cross-Sectional Research One of the earliest comprehensive studies of age trends on an individually administered intelligence test was reported by Wechsler (1944 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1727) ) shortly after publication of the Wechsler-Bellevue Form I. As is true of all the Wechsler tests designed for adults, raw scores on the W-B I subtests were first transformed into standard scores (referred to as scaled scores) with a mean of 10 and standard deviation of 3. Regardless of the age of the subject, these scaled scores were based on a fixed reference group of 350 subjects ages 20 to 34 included in the standardization sample. By consulting the appropriate age table, the sum of the 11 scaled scores was then used to find an examinee’s IQ.
However, the sum of the scaled scores by itself is a direct index of an examinee’s ability relative to the reference group. Wechsler used this index to chart the relationship between age and intelligence. His results indicated a rapid growth in general intelligence in childhood through age 15 or 20, followed by a slow decline to age 65. He was characteristically blunt in discussing his findings:
If the fact that intellectual growth stops at about the age of fifteen has been a hard fact to accept, the indication that intelligence after attaining its maximum forthwith begins to decline just as any other physiological capacity, instead of maintaining itself at its highest level over a long period of time, has been an even more bitter pill to swallow. It has, in fact, proved so unpalatable that psychologists have
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 72/79
generally chosen to avoid noticing it. (Wechsler, 1952 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1729) )
Normative studies with subsequent Wechsler adult tests revealed exactly the same pattern. For example, results for the WAIS-IV have been computed in Figure 6.12 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec11#ch06fig12) , which shows the average uncorrected subtest scores for all age groups in the normative sample, relative to results for the highest scoring age group (25- to 29-year-olds).
FIGURE 6.12 The Curve of Supposed Age-Related Decline in Average WAIS-IV Subtest Scores
Source: Based on data in Wechsler, D. (2008 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1737) ). Manual for the Wechsler adult intelligence scale—fourth edition. San Antonio, TX: Pearson.
Overlooked by Wechsler and many other cross-sectional design (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss83) researchers was the influence of their methodology on their findings. It has been recognized for quite some time that cross- sectional studies often confound age effects with educational disparities or other age-group differences (see Baltes, Reese, & Nesselroade, 1977 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib81) ; Kausler, 1991 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib871) ). For example, in the normative studies of the Wechsler tests, it is invariably true that the younger standardization subjects are better educated than the older ones. In all likelihood, the lower scores of the older subjects are caused, in part, by these educational differences rather than signifying an inexorable age-related decline.
Sequential Studies of Intelligence
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 73/79
To control for age-group differences, many researchers prefer a longitudinal design (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss191) in which the same subjects are retested one or more times over periods of 5 to 10 years and, in rare cases, up to 40 years later. Because there is only one group of subjects, longitudinal designs eliminate age-group disparities (e.g., more education in the young than the old subjects) as a confounding factor. However, the longitudinal approach is not without its shortcomings. Longitudinal studies are prone to practice effects, which is the finding that participants learn the answers when they take the same test on several occasions; selective attrition, which is the observation that the least healthy participants are the most likely to drop out; and history, which is the discovery that major historical events (e.g., the Great Depression) can distort the intellectual and psychological development of entire generations.
The most efficient research method for studying age changes in ability is a cross-sequential design (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm01#bm01gloss84) that combines cross-sectional and longitudinal methodologies (Schaie, 1977 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1449) ):
In brief, the researchers begin with a cross-sectional study. Then, after a period of years, they retest these subjects, which provides longitudinal data on several cohorts—a longitudinal sequence. At the same time, they test a new group of subjects, forming a second cross-sectional study—and, together with the first cross-sectional study, a cross-sectional sequence. This whole process can be repeated over and over (every five or ten years, say) with retesting of old subjects (adding to the longitudinal data) and first-testing of new subjects (adding to the cross-sectional data). (Schaie & Willis, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1456) )
In 1956, Schaie began the most comprehensive cross-sequential study ever conducted in what is referred to as the Seattle Longitudinal Study (Schaie, 1958 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1448) , 1996 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1453) , 2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1454) ). He administered Thurstone’s test of five primary mental abilities (PMAs) and other intelligence-related measures to an initial cross-sectional sample of 500 community-dwelling adults. The PMA Test subtests include Verbal Meaning, Space, Reasoning, Number, and Word Fluency. In 1963, he retested these subjects and added a new cross-sectional cohort. Additional waves of data were collected in 1970, 1977, 1984, 1991, and 1998.
Three conclusions emerged from Schaie’s cross-sequential study of adult mental abilities:
Each cross-sectional study indicated some degree of apparent age-related decrement in mental abilities, postponed until after age 50 for some abilities, but beginning after age 35 for others. In particular, Number skills and Word Fluency showed an age-related decrement only after age 50, whereas Verbal Meaning, Space, and Reasoning scores appeared to decline sooner, after age 35. Successive cross-sectional studies—the cross-sectional sequence—revealed significant intergenerational differences in favor of those born most recently. Even holding age constant, those born and tested most recently performed better than those born and tested at an earlier time. For example, 30-year-old examinees tested in 1977 tended to score better than 30-year-old examinees tested in 1970, who tended to score better than 30-year-old examinees tested in 1963, who, in turn, outperformed 30-year-old examinees tested in 1956. However, these cohort differences in intelligence were not uniform across the different abilities measured by the PMA Test. The pattern of rising abilities was most apparent for Verbal Meaning, Reasoning, and Space. Cohort changes for Number and Word Fluency were more complex and contradictory.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 74/79
In contrast to the moderately pessimistic findings of the cross-sectional comparisons, the longitudinal comparisons showed a tendency for mean scores either to rise slightly or to remain constant until approximately age 60 or 70. The only exceptions to this trend involved highly speeded tests such as Word Fluency, in which the examinee must name words in a given category as quickly as possible, and Number, in which the examinee must complete arithmetic computations quickly and accurately.
The results of the Schaie study are even more optimistic when individual longitudinal findings are disentangled from the group averages. As previously noted, the longitudinal findings differed from one mental ability to another. Nonetheless, taking the average of the five PMAs and using the 25th percentile for 25-year-olds as his standard of meaningful decline, Schaie has shown that no more than 25 percent of those studied had declined by age 67. From age 67 to age 74 about a third of the subjects had declined, whereas from age 74 to age 81, slightly more than 40 percent had declined (Schaie, 1980 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1451) , 1996 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1453) ; Schaie & Willis, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1456) ). In sum, the vast majority of us show no meaningful decline in the skills measured by the Primary Mental Abilities Test until we are well into our seventies. Perhaps even more impressive is the fact that approximately 10 percent of the sample improved significantly when retested in their seventies and eighties. Based on his research and other longitudinal studies, Schaie arrives at this conclusion:
If you keep your health and engage your mind with the problems and activities of the world around you, chances are good that you will experience little if any decline in intellectual performance in your lifetime. That’s the promise of research in the area of adult intelligence. (Schaie & Willis, 1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1456) )
A recent study by Gow, Johnson, Pattie, and others (2011 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib635) ) provides additional insight into the fate of intelligence in old age. They obtained follow-up test data from elderly persons at ages 70, 79, and 87, using the same instrument first administered to participants at age 11. One cohort, born in 1921, was retested at age 79 and again at age 87. A second cohort, born in 1936, was retested at age 70. Sample sizes were very large, in the hundreds at each testing. The same test, the Moray House Test, No. 12 (MHT), was used throughout. The MHT consists of 71 items involving diverse domains of general intelligence, including following directions, same-opposites, word classification, analogies, practical items, and reasoning. Although little recognized in the United States, the MHT is a respected instrument used in Scotland and elsewhere for tracking epidemiological changes in intelligence. MHT total scores correlate about .80 with Stanford-Binet IQ scores (Gow et al., 2011 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib635) ). The test does not provide an IQ. Results are given as a total raw score, with a maximum possible score of 76. Participants also took the Mini-Mental State Exam (MMSE) when tested in old age. As noted, this measure is a simple 30-item test of orientation, memory, and other cognitive skills. The MMSE is used for dementia screening, and normal adults typically score in the range of 27 to 30 points.
Mindful that the data come from separate cohorts born in 1921 and 1936, the results appeared to indicate a decline, after age 70, in general intelligence as measured by the MHT. Specifically, average scores at age 70, 79, and 87 were 64.2, 59.2, and 54.1, respectively, indicating a gradual decline in general intelligence after age 70. In contrast, orientation, memory, and everyday cognitive skills declined little, about a half a point (on the 30-item MMSE), on average, every decade or so. The scores for both the MHT and the MMSE revealed greater variability with advancing age, a common finding in research on aging.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 75/79
Gow et al. (2011 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib635) ) also sought to determine whether high intelligence in youth buffers against cognitive decline in old age. This was the special virtue of possessing test scores for all participants at age 11, which allowed researchers to map the trajectories of cognitive capacity as a function of initial ability. In the 1921 cohort tested at ages 79 and 87, they found that higher intelligence at age 11 did not slow the decline experienced in later life. Participants with initially higher MHT scores showed just as much cognitive decline as those with initially lower scores, but still maintained their relative advantage when tested in old age.
Age and the Fluid/Crystallized Distinction Although we concur with the conclusions of Schaie and Willis (1986 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1456) ), it would be unfair to leave the impression that all authorities in this area agree. Horn and Cattell have been the most vocal skeptics, arguing for a significant age-related decrement in fluid intelligence because of its reliance upon neural integrity, which is presumed to decline with advancing age (Horn & Cattell, 1966 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib787) ; Horn, 1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib785) ). Cross-sectional studies certainly support this view. For example, Wang and Kaufman (1993 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1713) ) plotted age differences in vocabulary and matrices scores from the Kaufman Brief Intelligence Test and found little change in vocabulary (crystallized measure) but a sharp drop in matrices (fluid measure). These results held true even when the scores were adjusted for educational level. Of course, cross-sectional studies are open to rival interpretations and can, therefore, only suggest longitudinal patterns. Readers who wish to pursue this controversy should consult Hofer, Sliwinski, & Flaherty (2002 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib751) ) and Lindenberger and Baltes (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib993) ).
More recently, Schaie, Caskie, Revell, and others (2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1454) ) demonstrated the same age-related patterns (negligible changes in crystallized measures, large decrements in fluid measures) in a follow-up study of older participants from the Seattle Longitudinal Study. Their participants comprised three groups: early-old (ages 60–69, N = 180), middle-old (ages 70–79, N = 205), and old-old (ages 80–95, N = 114). On average, the three groups were 64.2, 74.6, and 84.3 years of age, respectively. These individuals were administered a battery of 37 cognitive and neuropsychological measures assembled from well-known instruments, including the Wechsler Adult Intelligence Scale- Revised (WAIS-R, Wechsler, 1981 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1732) ), the Primary Mental Abilities test (PMA, Schaie, 1985 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1452) ), and several other tests. In Figure 6.13 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/ch06lev1sec11#ch06fig13) , we have depicted the results from four key subtests. Two of these subtests depend heavily on fluid cognitive factors (Reasoning and Spatial Thinking from the PMA), and two require significant crystallized abilities (Vocabulary and Comprehension from the WAIS-R). Scores are depicted as a percentage of the early-old group (ages 60–69), which typically earned the highest average score on all subtests. The reader will notice that raw scores on Comprehension and Vocabulary (crystallized abilities) reveal a nearly flat trend for the three age groups, whereas raw scores on Reasoning and Spatial Thinking (fluid abilities) disclose a steep decline for individuals in their 70s, 80s, and beyond.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 76/79
FIGURE 6.13 Cross-Sectional Comparison of Age Trends for Four Cognitive Subtests
Source: Based on data from Schaie, K. W., Caskie, G., Revell, A., & others (2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1454) ). Extending neuropsychological assessments in the Primary Mental Ability space. Aging, Neuropsychology, and Cognition, 12, 245–277.
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 77/79
6.12 GENERATIONAL CHANGES IN IQ SCORES What happens to the intelligence of a population from one generation to the next? For example, how does the intelligence of Americans in the year 2010 compare to the intelligence of their forebears in the early 1900s? We might expect that any differences would be small. After all, the human gene pool has remained essentially constant for centuries, perhaps millennia. Furthermore, only a small fraction of any generation is exposed do the extremes of environmental deprivation or enrichment that might stunt or boost intelligence dramatically. Common sense dictates that any generational changes in population intelligence would be minimal.
On this issue, common sense appears to be incorrect. Flynn (1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib514) , 1987 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib515) ) charted the comparison data from successive editions of the Stanford-Binet and the Wechsler tests from 1932 to 1981 and found that, with only one exception, each edition established a higher standard than its predecessor. For example, when the latest edition of the WISC-R was released in the 1970s, a large sample of five-and six-year-old children was tested on both this instrument and the earlier WPPSI, released in the 1960s. The testing was counterbalanced, half of the sample taking the WPPSI first, half taking the WISC-R first. The average WPPSI IQ for these 140 children was 112.8, whereas the same children earned an average WISC-R IQ of about 108.6. Because each new test is calibrated to a general population average of 100, this difference indicates an apparent 4-point gain in the population from the time the WPPSI was standardized (in 1965) to the time the WISC-R was standardized (in 1972). When new revisions are charted against their predecessors in the manner described here, the total apparent gain in mean IQ amounts to about 14 points in the five decades from 1932 to 1981 (Flynn, 1984 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib514) ).
This apparent rise in IQ over generations is known as the Flynn effect in honor of the psychologist who first delineated the occurrence (Flynn, 2007a (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib517) ). Although the Flynn effect may have slowed down in recent decades, in some countries, it is still found in nearly every comparison of average IQs for successive editions of mainstream intelligence tests. This trend of rising performance has been observed in many nations using other tests as well, including Raven’s Progressive Matrices and the Peabody Picture Vocabulary Test (Daley, Whaley, Sigman, Espinosa, & Neuman, 2003 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib386) ; Nettelbeck & Wilson, 2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1227) ).
However, IQ gains of the magnitude observed pose a serious problem of causal explanation. Flynn (1994 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib516) ) is skeptical that any real and meaningful intelligence of a population could vault upward so quickly. He concludes that current tests do not measure intelligence but rather a correlate with a weak causal link to intelligence:
Psychologists should stop saying that IQ tests measure intelligence. They should say that IQ tests measure abstract problem-solving ability (APSA), a term that accurately conveys our ignorance. We know people solve problems on IQ tests; we suspect those problems are so detached, or so abstracted from reality, that they ability to solve them can diverge over time from the real-world problem- solving ability called intelligence; thus far we know little else. (Flynn, 1987 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib515) )
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 78/79
Other explanations for the Flynn effect include better nutrition, improved prenatal care, greater educational access, and increased environmental complexity (Lynn, 2009 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1021) ; Sundet, Borren, & Tambs, 2008 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1600) ). On this last point, environmental complexity, Flynn (2007b (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib518) ) provides a telling illustration by way of generational changes in TV programs. He notes that early 1960s shows like I Love Lucy or Dragnet required almost no concentration to follow, whereas in the 1980s dramas like Hill Street Blues introduced up to 10 threads in the story line. More recently, the hit action-thriller drama 24 portrays as many as 20 characters and multiple plot lines.
In a recent interview, Flynn has suggested that ways of thinking and solving problems have undergone dramatic worldwide shifts in the last century.
Today we take it for granted that using logic on the abstract is an ability we want to cultivate and we are interested in the hypothetical. People from 1900 were not scientifically oriented but utilitarian and they used logic, but to use it on the hypothetical or on abstractions was foreign to them. Alexander Luria [a Soviet psychologist] went to talk to headmen in villages in rural Russia and he said to them: “Where there is always snow, bears are white. At the North Pole there is always snow, what colour are the bears there?” And they said: “I’ve only seen brown bears.” And he said: “What do my words convey?” And they said: “Such a thing as not to be settled by words but by testimony.” They didn’t settle questions of fact by logic, they settled them by experience (Witchalls, 2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1775) , p. 1).
Regardless of the causes, the Flynn effect has sensitized psychologists to the dangers of rendering conclusions based on ever-shifting intelligence test norms. Changes in IQ over time make it imperative to restandardize tests frequently, otherwise examinees are being scored with obsolete norms and will receive inaccurate IQ scores. This is especially a problem when IQ scores are used for important decisions such as eligibility for learning disability programs, or entitlement to social security benefits. At the other extreme, issues literally of life and death can be at stake when IQ scores impact capital punishment decisions via the diagnosis of mental retardation (Kanaya, Scullin, & Ceci, 2003 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib855) ).
Several recent studies indicate that the Flynn effect may have abated or even reversed in the beginning of the twenty-first century, at least in some countries. Reviewing data from more than a half-million Danish men over the period 1959 to 2004, Teasdale and Owen (2005 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1621) ) found that average performance on a military entry intelligence test gained slowly, peaked in the late 1990s, and has since declined slowly. Sundet, Barlaug, and Torjussen (2004 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1599) ) found a similar pattern with Norwegian conscripts on a test of matrix reasoning, with improved performance from the 1950s until the 1990s, followed by a reversal and decline. Using Piagetian tests of conservation of weight, volume, and quantity with seventh-grade British schoolchildren, Shayer, Ginsburg, and Coe (2007 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1488) ) documented a steady decline in performance from 1975 to 2003, a phenomenon they dubbed the “anti-Flynn effect.”
Yet, in many countries the Flynn effect continues unabated. Flynn and Rossi-Casé (2012 (http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib519) ) found large gains on Raven’s Progressive Matrices in Argentina between 1964 and 1998. In South Korea, te Nijenhuis, Cho, Murphy, and Lee (2012
8/4/2019 Print
https://content.ashford.edu/print/Gregory.8055.17.1?sections=ch06,ch06lev1sec1,ch06lev1sec2,ch06lev1sec3,ch06lev1sec4,ch06lev1sec5,ch06lev… 79/79
(http://content.thuzelearning.com/books/Gregory.8055.17.1/sections/bm02#bm02bib1617) ) reported large IQ gains as well. The Flynn effect continues to be a puzzling and complex phenomenon.