Case studies Assessment

Everleigh
Chapter7fromtextbook.pdf

CHAPTER 7 Testing Special

Populations TOPIC 7A Infant and Preschool Assessment 7.1 Assessment of Infant Capacities 7.2 Assessment of Preschool Intelligence 7.3 Practical Utility of Infant and Preschool Assessment 7.4 Screening for School Readiness 7.5 Dial-4

The individual and group tests reviewed in previous chapters are suitable for persons with normal or near-normal capacities in speech, hearing, vision, movement, and general intellectual ability. However, not every examinee falls within the ordinary spectrum of physical and mental abilities. By reason of immature age, physical disability, language

weakness, or diminished intellect, a large proportion of the population falls outside the reach of traditional tests and procedures. Infants and very young children certainly require exceptional approaches to assessment because of their limited capacities for communication. In Topic 7A, Infant and Preschool Assessment, we review the nature and application of infant and early childhood assessment devices and then investigate a fundamental question pertaining to these tests: What is the practical utility of testing children early in life? In particular, is there any predictive validity for test results obtained from infants or toddlers? If instruments for very young examinees do not predict important outcomes later in life, then using them would appear to be pointless and perhaps even misleading. We examine this quandary in some detail. Finally, we conclude the topic with a discussion of an important application of preschool testing—screening for school readiness. In Topic 7B, Testing Persons with Disabilities, we scrutinize a variety of tests

needed for the assessment of individuals with special needs. These special needs cover a wide spectrum, including language, hearing, and visual impairments. Of course, persons with developmental disabilities also require special approaches to assessment, and we provide coverage of this field as well. By one estimate, as many as 7.5 million U.S. citizens manifest intellectual disabilities, and 1 in 10 families are directly affected by this functional impairment (Grossman, Richards, Anglin, & Hutson, 2000). 7.1 ASSESSMENT OF INFANT

CAPACITIES The infant and preschool period extends from birth to roughly 6 years of age. The changes that occur during this period obviously are profound. The infant develops basic reflexes, masters developmental milestones (grasping, crawling, sitting, standing, and so forth), learns a language, and establishes the capacity for symbolic thought. For most children, the pattern and pace of development is visibly within normal limits.

However, parents and professionals trained in the assessment of infants and preschoolers occasionally encounter children whose development seems to be slow, delayed, or even overtly impaired. These children elicit a flurry of anxious questions: How delayed is this child? What are the prospects for normal functioning in school? Will this child achieve personal independence in the adult years? Another area of concern for many parents is the emotional development of their infants and children. Even normal children display trials and challenges that would test the saints. Visit any busy shopping mall and you will encounter scenes of hysterical, screaming children with frazzled parents attempting to cope. Listen to any honest parent with a toddler and you will hear a story or two of food smeared on walls, puppies tormented, obstinate refusal to stay in bed, or similar unpleasant actions. At what point do difficult and problematic behaviors portend a life of emotional troubles, when not promptly treated?

At the opposite extreme are those precocious children who achieve developmental milestones months or years ahead of the normative schedule. In these cases, the proud parents have a different set of concerns: How advanced is my child? What are the strongest and weakest areas of intellectual functioning? Will this child be a gifted adult? Infant and preschool assessment tools can help answer questions about the intellectual and emotional development of children, whether they are developmentally delayed, intellectually gifted, at-risk for emotional disorder, or within the normal spectrum. In this topic, we review the nature and application of representative infant and preschool measures. These tools include individual tests, developmental schedules, and rating scales. We begin with a description of several prominent instruments and then investigate the fundamental question of purpose or utility. What is the use of these measures? What is the meaning of a score on a developmental schedule or preschool intelligence test? To what extent do these

procedures allow us to prognosticate adult functioning or, for that matter, help us to predict early school performance? These questions will be more meaningful if we first review the relevant instruments. We divide the review into two parts: infant measures for children from birth to age 2½, and preschool tests for children from age 2½ to age 6. The division is somewhat arbitrary, but not entirely so. Infant tests tend to be multidimensional and to load significantly on sensory and motor development. Beginning at age 2½, standardized measures such as the Stanford-Binet: Fifth Edition, Kaufman Assessment Battery for Children-2, and Differential Ability Scales-II are typically used in the assessment of preschool children. These tests load heavily on cognitive skills such as verbal comprehension and spatial thinking. Thus, infant scales and preschool tests measure somewhat different components of intellectual ability.

Neonatal Behavioral Assessment Scale (NBAS) The Neonatal Behavioral Assessment Scale (NBAS) is unique because of its theoretical basis, which emphasizes the need to document the contributions of the newborn to the parent– infant system. The pediatrician T. Berry Brazelton (Brazelton & Nugent, 1995) developed this instrument to identify and understand the “deviant” infant and to explore the baby’s reciprocal impact on parents: My goal in developing the NBAS was to assess

the baby’s contributions to the failures that resulted, when parents were presented with a difficult or deviant infant. If we could understand the reasons behind the infant’s deviant behavior, perhaps we could in turn lead parents to a better understanding of their role. This then could lead to a more optimal outcome. (Brazelton & Nugent, 1995)

The NBAS is suitable for infants up to two months of age but is most commonly

administered in the first week of life. The scale assesses the infant’s behavioral repertoire on 28 behavior items, each scored on a 9-point scale. Examples of the behavior items include the following:

• Response decrement to light • Orientation to inanimate visual stimulus • Cuddliness • Consolability

In addition, the infant’s neurological status is evaluated on 18 reflex items, each scored on a 4-point scale. Examples include the following: • Plantar grasp • Babinski reflex • Rooting reflex • Sucking reflex

Finally, seven supplementary items can be used to summarize the qualities of responsiveness of frail, high-risk infants, including these: • Quality of alertness • General irritability • Examiner’s emotional response to infant

Brazelton and Nugent (1995) do not provide an integrative scoring system; that is, there are no

summary scores for the entire battery or its subcomponents. Instead, the “scoring” of the NBAS consists of a summary sheet with ratings on each specific item. In clinical work, the instrument is used to provide feedback to parents. Specifically, Brazelton recommends that health care professionals demonstrate the NBAS in order to sensitize parents to their baby’s uniqueness and to promote a positive parent–infant relationship. Hawthorne (2009) describes the clinical application of the instrument for promoting successful caregiving strategies. Regarding clinical use of the test, Fowles (1999) compared mothers who received a demonstration of the NBAS with a matched control group and showed that the intervention group subsequently rated their infants as significantly more predictable. Thus, the NBAS was found to be useful in helping mothers anticipate their infants’ responses to environmental stimuli. However, based on a comprehensive review of published studies, Britt and Myers (1994) provide a less optimistic review of the effects of the NBAS intervention,

noting inconsistent findings in areas such as parent–infant interaction, infant development, temperament, and parental attitudes and satisfaction. For research on newborn outcomes, various investigators have developed scoring systems for the NBAS, including a popular seven-cluster scoring method proposed by Lester (1984). This method provides summary scores for identified clusters (habituation, orientation, motor performance, arousal/lability, regulation, autonomic stability, and reflexes). Using a quantitative scoring approach, researchers have linked prenatal cocaine exposure to inferior performance on the NBAS (Morrow et al., 2001; Schuler, 1999). In addition, the NBAS is also sensitive to the detrimental effects of polychlorinated biphenyls (PCBs) on babies born to women who consumed contaminated Lake Ontario fish (Stewart, Reihman, Lonky, Darvill, & Pagano, 1999). The NBAS also shows sensitivity to the impact of major depression in mothers by revealing greater arousal and less attentiveness to face/voice

stimuli in their newborn babies (Hernandez- Reif, Field, Diego, & Ruddock, 2006). Further, the instrument is sensitive to changes in feeding behavior of premature infants (Medoff-Cooper & Ratcliffe, 2005). In general, these studies demonstrate the value of the NBAS in a wide variety of research endeavors with infants. In spite of the proven utility of the NBAS as a clinical and research tool, reviewers have been somewhat skeptical about the psychometric properties of the instrument. For example, Majnemer and Mazer (1998) point to very low test–retest reliability coefficients (r = −0.15 to +0.32 for the individual items) and weak interrater agreement. One likely explanation is that in newborn infants, individual traits may fluctuate rapidly over short periods of time, which would produce an underestimate of true reliability when the NBAS is given twice over a period of days or weeks. For this reason, deviant scores from a single administration of the NBAS should not be overinterpreted. Bayley-III

Originally released in 1969, the Bayley test is now in its third edition (Bayley, 2006). Suitable for children 1 month to 42 months of age, this instrument is an important mainstay for the evaluation of developmental delay in infants and toddlers. Known formally as the Bayley Scales of Infant and Toddler Development-III and informally as the Bayley-III, the most recent version represents a vast extension and revision of the earlier editions. For example, the first edition of the test evaluated only the cognitive and motor capacities of infants, whereas the latest edition provides for the assessment of five domains. The domains and representative capacities tested are listed here. • Cognitive Scale: 91 items involving sensory

acuity, perceptual skill, attention, object permanence, exploration and manipulation, puzzle solving, color matching, and counting. The Cognitive Scale does not contain separate subtests.

• Language Scale: 48 items involving receptive and expressive communication. Items involve recognition of sounds,

nonverbal expression, following simple directions, identifying action pictures, naming objects, and answering questions. The Language Scale yields separate scores for Expressive Communication and Receptive Communication as well as a composite Language Scale score.

• Motor Scale: 138 items pertaining to gross motor and fine motor skills. Items involve object manipulation, functional hand skills, postural control, dynamic movement, and motor planning. The Motor Skill yields separate scores for Gross Motor and Fine Motor as well as a composite Motor Scale score.

• Social-Emotional: 35 items involving interactive and purposeful use of emotions, ability to convey feelings, and connection of ideas and emotions. The Social-Emotional Scale does not contain separate subtests.

• Adaptive Behavior Scale: Caregivers complete items on a 4-point scale of 0 (is not able), 1 (never when needed), 2 (sometimes when needed), or 3 (always

when needed); items pertain to Communication, Community Use, Health and Safety, Leisure, Self-Care, Self- Direction, Functional Pre-Academics, Home Living, Social, and Motor. This scale yields separate scaled scores for each of the ten areas listed, as well as a General Adaptive Composite (GAC).

The five major clusters listed above each yield a composite score reported as a standard score (M = 100, SD = 15). Note that the Bayley-III does not yield an overall score akin to an IQ score on a traditional test. Such a score could be misleading in light of the broad range of diverse skills now assessed in the third edition of the test. Instead, the instrument seeks to yield a profile of scores useful in infant assessment and diagnosis. To this end, all scores on the instrument (including the many subscales listed above) can be reported as scaled scores (mean = 10, SD = 3) for purposes of intra-individual comparison. This yields a useful chart that helps pinpoint areas of needed intervention. For example, the child depicted in Table 7.1, a 37-

month-old boy referred for assessment, appears to present with mild intellectual disability characterized by problems with expressive communication, fine motor skills, communication, functional pre-academics, and self-direction. TABLE 7.1 Bayley-III Scaled Score Results for a 37-Month-Old Infant

Cog = Cognitive, RC = Receptive Communication, EC = Expressive Communication, FM = Fine Motor, GM = Gross Motor, SE = Social-Emotional, Com = Communication, CU = Community Use, FA = Functional Pre-Academics, HL = Home Living, HS = Health and Safety, LS = Leisure, SC = Self-Care, SD = Self-Direction, Soc = Social, MO = Motor. Note: An average score in the general population is 10, and scores between 8 and 12 typically are considered normal. Scores of 4 or below, indicated in bold, are areas of potential concern.

Cog Language Motor SE Cog RC E FM GM SE

6 7 4 3 8 4 Adaptive Behavior

Com CU FA HL HS LS SC SD Soc MO 4 7 4 8 7 7 5 4 6 6

The technical quality and excellent standardization of the Bayley-III mark this test as the psychometric pinnacle of its field. The normative sample of 1,700 children was stratified according to age and essential demographic variables, and the test developers also collected extensive data on children with high-incidence clinical diagnoses such as autism and intellectual disability. Internal consistency reliability of the five composite scores appears to be strong, with average reliability coefficients as high as .93 (Language) and .91 (Cognitive). Test–retest reliability over a short period (average of 6 days) is predictably lower, with coefficients ranging from .67 (Fine Motor) to .80 (Expressive Communication). Average stability coefficient across all ages for the major composites was .80, which is decent given that infants and toddlers are notoriously distractible. Validity evidence for the Bayley-III is scant at this time, but wholly supportive. For example, confirmatory factor analysis of the subtests of the Cognitive, Language, and Motor scales supported the three-factor model across all age

groups of the standardization sample, except for the youngest age group (Bayley, 2006). Concurrent validity coefficients with other instruments are strong as well. For example, The WPPSI-III Full Scale IQ scores correlated .72 to .79 with Bayley-III Cognitive composites. Correlations of the Motor and Adaptive Behavior composites with suitable instruments also were appropriately strong, on the order of .50 to .70. We agree with reviewers who assert that the Bayley-III continues to set the standard for early childhood assessment, and will maintain its status as the most frequently used measure of infant and toddler development (Albers & Grieve, 2007). Devereux Early Childhood Assessment- Clinical Form (DECA-C) The Devereux Early Childhood Assessment- Clinical Form (DECA-C) is a refreshing addition to the assessment field. The scale is designed for the assessment of preschoolers aged 2:0 through 5:11 with social and emotional troubles or significant behavioral concerns

(LeBuffe & Naglieri, 1999ab, 2003). What makes the instrument unique is the noteworthy focus on protective factors that can buffer the impact of social, emotional, or behavior difficulties. DECA-C consists of three protective factor scales (Initiative, Self-control, and Attachment), as well as four problem scales (Attention Problems, Aggression, Withdrawal/ Depression, and Emotional Control Problems). The measure can be completed by both parents and teachers. The response options for the 62 items require that the parent or teacher rate the frequency of various behaviors on a 5-point scale (never, rarely, occasionally, frequently, very frequently). When combined, the three protective factor scale scores provide a Total Protective Factors score that indicates possible sources of resilience for the child. These scales include: • Initiative: Assesses the child’s ability to use

independent thought and behavior to meet his or her needs. Items resemble “Retrieves things by himself or herself.”

• Self-control: Measures the child’s capacity to experience and express a range of emotions in a socially acceptable manner. Items resemble “Controls his or her temper.”

• Attachment: Assesses the child’s formation of strong and long-lasting relationships with parents, teachers, and family members. Items resemble: “Accepts adult comforting when upset.”

The DECA-C is based, in part, on resilience theory, as proposed by Werner (1990) and described by others (e.g., Masten, Best, & Garmezy, 1990). Resilience theory is a strengths-based approach that concentrates on protective factors at three levels: environmental (high-quality childcare and schools), family (nurturing parents and extended family), and within-child (adaptive personality traits). LeBuffe and Naglieri (1999b) summarize the essentials: Children whose behavior reflects these

protective factors tend to have positive outcomes despite stress and are often characterized as resilient. Children lacking

or with underdeveloped protective factors are more likely to develop emotional and behavioral problems under similar risk conditions and are described as vulnerable (p. 75).

The purpose of appraising protective factors is so that interventions can build upon the child’s strengths. The focus on resilience provides a hopeful supplement to the usual, customary appraisal of problem areas. In addition to protective factors, the DECA-C also provides a well-conceived analysis of behavioral concerns. When combined, the four problem scales yield a Total Behavioral Concerns score that indicates the vulnerability of the child to social and emotional difficulties. These scales include: • Attention Problems: Assesses the child’s

ability to focus on a task and ignore distracting environmental stimuli. Items resemble: “Loses focus on the task at hand.”

• Aggression: Measures aggressive or destructive acts directed at other persons or things. Items resemble: “Destroys personal

property of others.” Withdrawal/Depression: Assesses self-absorption and emotional/ social withdrawal. Items resemble: “Appears wrapped up in his/her own world.”

• Emotional Control Problems: Measures difficulties in controlling negative emotions that interfere with goal directed behavior. Items resemble: “Loses temper when things don’t go his/her way.”

Standardization of the DECA-C is exemplary, based on 1,108 preschool-aged children rated by parents or teachers. The sample approximated national data for preschoolers with respect to race, ethnicity, geographic region, and family income. Internal consistency reliability with these samples was good. For the parents, coefficient alphas for the subscales were typically in the high .70s (median .78), whereas the values for teachers were higher, typically in the high .80s (median .88). Discriminant analysis with the Total Behavior Concerns scale scores revealed a 74 percent accuracy in classifying clinical versus community cases,

suggesting good criterion validity (LeBuffe & Naglieri, 1999b). Several recent studies support the validity and utility of the DECA-C. Ogg et al. (2010) conducted a confirmatory factor analysis of scores for 1,344 children on the protective factors scales, and determined that the factor structure proposed by the original authors was adequate, with minor modifications in wording. Specifically, a few items revealed differential item functioning for boys versus girls, suggesting that minor adjustments to item wordings would strengthen their respective subscales. Jaberg, Dixon, and Weis (2009) replicated the original factor structure as well and found adequate internal consistency for the protective factors scales in a sample of 780 kindergarten children. Lien and Carlson (2009) favorably describe use of the instrument with Head Start populations. Additional Measures of Infant Capacity As we have learned, the assessment of infants can be vital and yet is so tricky. Infants

ordinarily do not follow directions and they may not be able to verbalize what they know. Assessment is a huge challenge. Nonetheless, dozens of test developers have risen to the summons. Even a brief review of alternative instruments would be chapter-length. We refer the reader to the remarkable 400-page review provided by Berry, Bridges, and Zaslow (2004), which is available online at www.childtrends.org. This compendium provides thoughtful reviews of dozens of scales for learning, cognition, language, literacy, math, social-emotional, and Head Start outcomes. 7.2 ASSESSMENT OF PRESCHOOL INTELLIGENCE Preschool children exhibit wide variability in emotional maturity and responsiveness to adults. One child may warm up to the examiner and strive for optimal performance on all questions. Another child may stare mutely at the floor rather than attempt a simple block design task.

For the first child, we can be rest assured that the test results are an appropriate index of cognitive functioning. But for the second child, uncertainty prevails. Does the nonresponsiveness signal a lack of skill or a lack of cooperation? With preschool children, a large measure of humility is required of the examiner. Scarr (1981) has expressed this sentiment as follows: Whenever one measures a child’s cognitive

functioning, one is also measuring cooperation, attention, persistence, ability to sit still, and social responsiveness to an assessment situation.

The special danger in preschool assessment is that the examiner may infer that a low score indicates low cognitive functioning when, in truth, the child is merely unable to sit still, attend, cooperate, and so forth. Preschool assessment needs to be approached with unusual caution to avoid negative consequences of labeling and overdiagnosis of disabling conditions.

There are several individually administered intelligence tests suitable for preschool children. The most commonly used instruments include: • Kaufman Assessment Battery for Children-2

(KABC-2) • Differential Ability Scales-II (DAS-II) • Wechsler Preschool and Primary Scale of

Intelligence-IV (WPPSI-IV) • Stanford-Binet Intelligence Scales for Early

Childhood, Fifth Edition (Early SB5) The KABC-2 was described in the previous chapter. We will focus here on the Differential Ability Scales-II, the WPPSI-IV, and the Early SB5. Differential Ability Scales-II The Differential Ability Scales-II (DAS-II) is the latest edition of a highly respected test initially published in 1990 (Elliott, 1990, 2007). The test consists of three batteries: The Early Years Battery (lower-level) for ages 2-6 to 3-5, the Early Years Battery (upper-level) for ages 3-6 to 6-11, and the School-Age Battery for ages

7-0 to 17-11. We focus here on the battery used with preschool children aged 3-6 to 6-11. The DAS-II includes 10 core subtests and 10 diagnostic subtests; however, rarely is a child administered all 20 subtests. The core subtests are the primary measures of cognitive abilities, whereas the diagnostic subtests provide supplementary information about school readiness and information processing. The particular combination of subtests administered depends on the child’s age, ability level, and the purposes of assessment. For preschool children age 3½ and above, a comprehensive test battery would include six core subtests and seven diagnostic subtests, which are described in Table 7.2. The core subtests are heavily saturated with the g factor and are used to derive three core cluster scores (Verbal, Nonverbal Reasoning, and Spatial) and an overall composite score known as General Conceptual Ability (GCA). An optional cluster score known as the Special Nonverbal Composite (SNC) can be computed from four nonverbal subtests as well. In

developing the DAS and its revision, Elliott (2007) steered away from concepts of intelligence and IQ, using the more neutral designation of GCA instead. Even so, most experts in the field would consider GCA to be essentially the same as IQ. The diagnostic subtests measure early number concepts, phonological processing, short-term memory, and processing speed. These subtests and the diagnostic composites derived from them are used for clinical analysis only. The diagnostic subtests are less dependent on the g factor and therefore do not figure in the GCA or any core composites. The diagnostic subtests contribute to three diagnostic cluster scores (School Readiness, Working Memory, and Processing Speed). These subtests provide information useful in assessing learning problems and school readiness, thereby complementing the core subtests. The DAS-II is normed to standard scores (M = 100, SD = 15) for the GCA and cluster scores, whereas the individual subtests are based on T scores (M = 50, SD = 10). The DAS-II was normed on 3,480

U.S. children, with careful stratification (2002 census data) on age, gender, race/ethnicity, parental education, and geographic region. The reliability of DAS-II scores is commendable for an instrument used at the preschool level. Typically, preschool children are easily distracted and plainly influenced by situational factors, which tends to lower the reliability of test scores. The DASII seems relatively immune to these influences. For preschoolers, GCA internal consistency reliability is reported to be .95. The cluster scores also show excellent reliability with values ranging from .89 to .95. Internal consistency reliability of the subtests is predictably lower, although still laudable, ranging from .81 to .91. As is often found in reliability studies, test–retest reliability figures were significantly lower, based on retesting of 369 children after a period ranging from 7 to 63 days. These coefficients ranged from .51 to .92, with most values in the .70s and .80s. The validity of the DAS-II looks promising from several perspectives. First, the measure

reveals very strong correlations with other tests of preschool cognitive functioning and achievement. For example, DAS-II GCA scores correlate strongly with mainstream intelligence tests, for example, r = .87 with WPPSI-III IQ, and r = .84 with WISC-IV IQ. Likewise strong correlations are observed with major achievement tests, for example, r = .82 with WIAT-II total achievement, and r = .81 with KTEA-II total achievement. Another line of validity evidence for the DAS-II consists of test data for 12 special groups, including children with giftedness, mental retardation, reading disorder, ADHD and learning disorder, and limited English proficiency. In general, these groups reveal theory-consistent patterning of scores, for example, those with reading disorders score relatively low on the Verbal Ability cluster, those with ADHD and learning disorder score relatively low on the School Readiness cluster, those known to be gifted earn average GCA scores of 125, and so forth. TABLE 7.2 DAS-II Subtests on the Early- Years Battery, Upper Level

Subtest Abilities Measured

Contributi on to

Composite (s)

Core Subtests Verbal Comprehe nsion

Receptive language, understanding of oral instructions

GCA, Verbal Ability

Naming Vocabular y

Expressive language, knowledge names and object

GCA, Verbal ability

Picture Similaritie s

Nonverbal reasoning, matching pictures with common themes

GCA, Nonverbal reasoning ability

Matrices Abstract reasoning, deducing the missing pattern in a matrix

GCA, Nonverbal reasoning ability

Pattern Constructi on

Nonverbal, spatial visualization with colored blocks and squares

GCA, Spatial Ability

Copying Design copying, fine- motor coordination, visual-spatial matching

GCA, Spatial Ability

Diagnostic Subtests Early Number Concepts

Knowledge of numerical concepts—number, order, addition, subtraction

School Readiness

Matching Letter- Like Forms

Seeing spatial relationships, visually discriminating similar forms

School Readiness

Phonologi cal Processin g

Ability to process syllables, sounds, and phonemes, e.g., rhyming, blending

School Readiness

Recall of Sequential Order

Visualization and recall, e.g., order of body parts (belly, hair, toe, chin)

Working Memory

Recall of Digits Backward

Short-term auditory recall for sequences, mental manipulation

Working Memory

Note: GCA = General Conceptual Ability. Also, a Special Nonverbal Composite (SNC) can be computed from the four nonverbal core subtests. Confirmatory factor analyses reported in the technical manual leave a confusing picture as to the underlying structure of the DAS-II. The number of factors providing the best fit to the test data differs by age group, ranging from a 2- factor solution for the youngest age group (age 2-6 to 3-5) to a 7-factor solution for children ages 6-0 to 12-11, with 5- and 6-factor models for other age groups. On the other hand, the DAS-II is not predicated on any particular model of intelligence, so the pertinence of confirmatory factor analyses is questionable.

Speed of Informatio n Processin g

Rapid visual scanning and simple decision- making

Processing Speed

Rapid Naming

Naming colors and pictures as quickly as possible

Processing Speed

Even though the DAS-II has been available for a few years, there is almost no published research using the test. One study found the instrument valuable in the evaluation of specific learning disability (SLD). In particular, regression equations using the cluster scores were helpful in identifying children with SLD in mathematics (Hale, Fiorello, Dumont, and others, 2008). Beran (2007) reviews the test favorably, with this understatement: “The test is complex.” In fact, the summary page of the record form for hand scoring proves so difficult to follow that computer scoring is nearly mandatory. Sattler (2008) provides an especially thorough overview of the DAS-II. Wechsler Preschool and Primary Scale of Intelligence-IV (WPPSI-IV) The WPPSI-IV is a significant revision of its predecessor, the WPPSI-III, and continues a long tradition of excellence in the assessment of preschool and primary school children (Wechsler, 2012). The test is suitable for children ages 2½ to 7 years and 7 months,

although a slightly different mix of subtests is used for younger children (ages 2-6 to 3-11) than for older children (ages 4:0 to 7:7). We discuss only the version for older children here. The full battery includes up to 13 subtests, but only 6 are needed to obtain a Full Scale IQ (FSIQ), although this is rarely the solitary goal of assessment. In most situations, examiners will find it indispensable to compare and contrast the various subcomponents of general intelligence, not just to get a FSIQ. For this more useful assessment, an additional 4 subtests are needed, for a total of 10 subtests, which is the most common WPPSI-IV battery. The final 3 subtests (for a total of 13 subtests) are needed only for special ancillary index scales discussed later. We begin our discussion in reference to the standard 10 subtests normally administered. Based on factor analytic studies, clinical considerations, and a comprehensive review of the latest research on cognitive abilities, the developers of the WPPSI-IV concluded that five Primary Index Scales, each based on two subtests, are needed to capture the complexity of

cognitive abilities in older children. The structure of the WPSSI-IV is outlined in Table 7.3. One desirable feature of the new edition is the use of child-friendly and developmentally appropriate stimulus materials. For example, in the new subtest Zoo Locations, one part of the working memory composite, the child views one or more animal cards placed on a large zoo layout for a predetermined time, then works with an “empty” zoo to place each card in the correct location. Another example of adapting test materials to the needs of children is the use of an ink dauber (essentially a large felt-tip pen) rather than a pencil to indicate responses on processing speed subtests. This reduces the confounding of the subtest (a measure of processing speed) with fine motor demands (a measure of motor prowess). TABLE 7.3 Primary Index Structure of the WPPSI-IV at Ages 4:0 to 7:7

Note: The six subtests in boldface are used in the computation of Full Scale IQ. The WPPSI-IV is a recent revision, so there is little independent research on its psychometric properties or clinical utility. However, the similarities of this instrument with other Wechsler tests suggest that it will be a mainstay of preschool and primary school assessment. In closing, we should mention that the test allows for the computation of four Ancillary Index Scales: • Vocabulary Acquisition: 2 subtests,

Receptive Vocabulary and Picture Naming.

Primary Index Subtests Used Verbal Comprehension

Information, Similarities

Visual Spatial Block Design, Object Assembly

Fluid Reasoning Matrix Reasoning, Picture Concepts

Working Memory Picture Memory, Zoo Locations

Processing Speed Bug Search, Cancellation

• Nonverbal: 9 subtests with minimal verbal demand, including Block Design and Matrix Reasoning.

• General Ability: 8 subtests, mainly untimed, including Information, Similarities, and Matrix Reasoning.

• Cognitive Proficiency: 5 subtests, including Picture Memory, Cancellation, and Animal Coding.

These index scales can be useful in special circumstances such as the assessment of deaf children (Nonverbal battery), evaluation of bright children with slower processing (General Ability battery), and assessment of mental proficiency (Cognitive Proficiency battery). The Cognitive Proficiency battery includes measures of memory and speeded visual search. Stanford-Binet Intelligence Scales for Early Childhood Known informally as the Early SB5, the Stanford-Binet Intelligence Scales for Early Childhood (Roid, 2005) combine the subtests from the Stanford-Binet Intelligence Scales,

Fifth Edition (SB5) with a new Test Observation Checklist and a software-generated Parent Report. The subtests of the SB5 were described in the previous chapter. We focus here on the Test Observation Checklist (TOC), which summarizes essential information about child test-taking behaviors—in particular, behaviors that may have a stunning impact on test scores. The Early SB5 was developed for children ages 2 years to 7 years and 3 months. This is precisely the age range in which a child’s true level of functioning can be radically underestimated due to behavior problems such as distractibility, low frustration tolerance, or noncompliance. For example, many preschool children simply stop responding when subtest items become difficult—they may look down, or look away, or offer a comment on an unrelated topic. Noncompliant behavior of this nature is common; in fact, occasional refusals are reported in 41 percent of young children (Aylward & Carson, 2005). But a refusal can mean many things. Perhaps the child really doesn’t know the answer; or perhaps the child

knows the answer but is bored with testing, or afraid to hazard a guess, or simply distracted. The examiner will never know for sure, but there is a good chance that the true cognitive abilities of a noncompliant child will be underestimated. The purpose of the TOC is to provide a qualitative but highly structured format for describing a wide range of behaviors, including noncompliance, known to affect test performance. The test-taking behaviors listed on the TOC are divided into two groups: (1) Characteristics and (2) Specific Behaviors. The former are general traits most likely found in many situations, whereas the latter are specific behaviors actually observed during the testing session. The focus of the TOC is behaviors that negatively impact test performance. Many of the characteristics and behaviors are rated on a continuum, whereas others are categorical. The characteristics rated include (Aylward & Carson, 2005):

• Motor Skills—includes gross motor skills such as clumsiness and fine motor skills such as pencil dexterity.

• Activity Level—includes both excessive restlessness as well as underactivity in relation to child’s age.

• Attention/Distractibility—refers to age- inappropriate inattention, a need for redirection.

• Impulsivity—indicates the examiner saw fit to intervene, slow the child down.

• Language—includes articulation, receptive language, and expressive language.

The specific behaviors rated include (Aylward & Carson, 2005): • Consistency in Performance—may indicate

a haphazard approach to the test. • Mood—includes specific behavioral

indicators such as negative mood, tantrums, or crying.

• Frustration Tolerance—includes aggressiveness, refusal to participate.

• Change in Mental Set—includes noted tendencies toward rigidity of approach or perseveration.

• Motivation—includes disinterest or boredom and related behaviors.

• Fear of Failure—is qualitatively judged through inference and can be corroborated through parental report.

• Degree of Cooperativeness/Refusals—a crucial category because numerous refusals can lead to underestimating cognitive ability.

• Anxiety—includes excessive fearfulness, shyness, or need for parental presence.

• Need for Redirection—is noted when the child cannot stay on task and constantly needs reminders.

• Parental Behaviors—includes items such as parental reassurance, tacit approval for misbehavior, or giving verbal cues.

• Representativeness of Test Behaviors—is based on brief interview with parent(s), if present during testing.

The TOC helps the examiner identify problematic behaviors that may affect the

validity of the test results. But this is not the only purpose of this instrument. In addition, the documentation of these behavior problems may prove helpful in the early detection of developmental difficulties such as learning disabilities, behavior problems, attentional difficulties, borderline cognitive function, and neuropsychological deficits (Aylward & Carson, 2005). 7.3 PRACTICAL UTILITY OF INFANT AND PRESCHOOL ASSESSMENT The history of child assessment has shown time and again that, in general, test scores earned in the first year or two of life show minimal predictive validity. For example, in her review of infant intelligence testing, Goodman (1990) concludes: If the successful prediction of adolescent and

adult intelligence from early childhood scores is one of the great accomplishments of applied psychology, then the failure to

predict intelligence from infancy to early childhood ranks as one of its greatest failures.

Given this dismal record of repeated failures of predictive validity, we must ask a difficult question: What is the purpose and practical utility of infant assessment? In fact, infant tests do have an important but limited role to play. We return to that issue after a review of predictive studies. Predictive Validity of Infant and Preschool Tests With heterogeneous samples of normal children, the general finding is that infant test scores correlate positively but unimpressively with childhood test scores (Goodman, 1990; McCall, 1979). A few studies are more optimistic in tone (e.g., Wilson, 1983), but most researchers agree with McCall’s (1976) conclusion: Generally speaking, there is essentially no

correlation between performance during the first six months of life with IQ score after age 5; the correlations are

predominantly in the 0.20s for assessments made between 7 and 18 months of life when one is predicting IQ at 5–18 years; and it is not until 19–30 months that the infant test predicts later IQ in the range of 0.40–0.55.

McCall (1979) reconfirmed his original conclusion in a later review, finding that the correlations between infant and school-age test scores do not exceed .40 until the subjects are at least 19 months of age for the initial testing. The findings with preschool tests are somewhat more positive in tone. The correlation between preschool test results and later IQ is typically strong, significant, and meaningful. The simplest way to investigate this question is to measure the stability of IQ results in longitudinal studies. In Table 7.4, we have summarized the age-to-age stability of children’s IQ scores on the Stanford-Binet from the Fels Longitudinal Study, an early, classic follow-up investigation of children’s intellectual and emotional development (Sontag, Baker, & Nelson, 1958). The lowest correlation in this

table is .43, and that is between IQ tested at age 4 and again at age 12. What stands out in the table is the robustness of the link between IQ in preschool and later childhood. The older the child at initial testing, the stronger the relationship with later IQ. In fact, the results suggest that IQ becomes reasonably stable, on average, by 8 years of age. Collectively, these findings confirm that infant tests generally have poor prognostic value, whereas preschool tests are moderately predictive of later intelligence. This brings us back to the question posed at the beginning of this section: What is the purpose and practical utility of infant assessment? Practical Utility of Infant Scales The most important and sound use of infant tests is in screening for developmental disabilities. Early detection of children at risk for mental retardation is vital because it provides for early intervention and, consequently, allows for improved outcomes later in life. Although existing infant tests are poor predictors of

childhood and adult intelligence, an exception to this rule is encountered for infants who obtain very low scores on the Bayley test and other screening tests. For example, infants who score two or more standard deviations below the mean on the original Bayley (1969) and the Bayley-II (Bayley, 1993), particularly on the Mental Scale, reveal a high probability of meeting the criteria for mental retardation later in childhood (Goodman, Malizia, Durieux-Smith, MacMurray, & Bernard, 1990). There is no longitudinal research with the very recent Bayley-III (Bayley, 2005), but this test likely possesses good predictive validity for low scores as well. TABLE 7.4 Stability of IQ from 3 to 12 Years of Age

Age at Retesting Age at Initial

Testing 4 5 6 7 8 9 10 11 12

3 0.830.720.730.640.600.630.540.510.46 4 0.800.850.700.630.660.550.500.43 5 0.870.830.790.800.700.630.62 6 0.830.790.810.720.670.67

Source: Adapted with permission from Sontag, L. W., Baker, C., & Nelson, V. (1958). Mental growth and personality development: A longitudinal study. Monographs of the Society for Research in Child Development, 23 (Whole No. 68). Copyright © by The Society for Research in Child Development, Inc. With at-risk children, the correlation between infant test scores and later childhood IQ is much stronger than for samples of normal children. The most consistent finding is that a very low score on an infant test—two or more standard deviations below the mean—accurately prognosticates mental retardation in childhood. For example, studies with the Denver Developmental Screening Test-Revised (since revised and published as the Denver-II) revealed a false-positive rate of only 5 to 11 percent, meaning that infants and preschoolers identified as at risk for mental retardation rarely achieve

7 0.910.830.820.760.73 8 0.920.900.840.83 9 0.900.820.81

10 0.900.88 11 0.90

normal range cognitive functioning in childhood (Frankenburg, 1985). Most studies with the Bayley test also conform to this pattern. For example, VanderVeer and Schweid (1974) found that 23 young children with mild, moderate, and severe mental retardation confirmed by the Bayley at ages 18 to 30 months continued to merit a diagnosis of mental retardation one to three years later. Although some of the children with moderate and severe mental retardation were functioning at a higher level (mild retardation), none of the children with initial mental retardation was normal at follow-up. In an ostensibly contradictory finding, Hack, Taylor, Drotar, and others (2005) reported that very low scores on the Bayley-II for low-birth- weight infants tested at 20 months of age did not strongly predict low scores on the K-ABC at age 8. These findings are cautionary, but not definitive, insofar as the K-ABC is not a good criterion for mental retardation. Fagan Test of Infant Intelligence (FTII)

The infant tests discussed in this chapter could be described as traditional, in the sense that their methods are a natural outgrowth of the long sweep of individual intelligence tests reaching back to the early 1900s. But perhaps new approaches are needed with infants. Lewis has argued that traditional infant tests overlook early information processing behaviors, such as recognition memory and attentiveness to the environment, that might better predict childhood cognitive function (Lewis & Sullivan, 1985). In one study, simple visual habituation to a novel stimulus (measured by the duration of fixation) assessed at 3 months of age correlated .61 with the Bayley Mental score at 24 months of age (Lewis & Brooks-Gunn, 1981). Fagan and McGrath (1981) reported similar findings. In their study, infants first observed a picture of a baby’s face for a short period of time and were then shown the same picture alongside an unfamiliar picture (e.g., picture of a bald-headed man). The investigators kept careful track of which picture the infants looked at more. The logic of the procedure is simple: Staring mainly

at the new picture signifies that an infant recognizes the old picture; that is, an infant with good recognition memory prefers to look at something new. Preference for novelty—as measured by visual fixation time on the new picture—thus becomes an index of early recognition memory. Years later, the investigators administered the Peabody Picture Vocabulary Test (PPVT) to gauge early childhood intelligence. Infant recognition memory scores and early child PPVT scores correlated .37 at 4 years of age and .57 at 7 years of age. Infant cognitive measures would appear to be promising predictors of childhood intelligence (Fagan & Haiken-Vasen, 1997). Using the paradigm described previously, Fagan (1984) developed a new approach to infant assessment known as the Fagan Test of Infant Intelligence (FTII). The FTII assesses visual recognition memory using a 10-trial habituation format (Fagan & Shepherd, 1986). In each trial, a photograph of a face is shown to the infant, followed by paired presentation of the original face with either (1) a photograph of a similar but

new face or (2) a photograph of the original face in a different orientation. The amount of time spent looking at the new photograph is presumed to indicate the degree to which the infant has noticed that it is different from the original picture. The examiner observes the infant’s corneal reflections to determine a percent Novelty Preference, averaged across the 10 trials. The procedure shows very high interrater agreement (O’Neill, Jacobson, & Jacobson, 1994). A score of less than 53 percent for novelty preference identifies children who are at risk for later mental retardation. Validation studies of the FTII as a predictor of childhood intelligence and as a screening tool for mental retardation are mixed in outcome. With regard to the prediction of intelligence, FTII scores obtained at 7 to 9 months of age correlated only .32 with Stanford-Binet IQ at age 3 for a sample of 200 infants (DiLalla, Thompson, Plomin, and others, 1990). In another study, overall correlations between FTII scores obtained at 7 to 9 months of age and WPPSI-R IQ at age 5 were very low, about .2,

for two Norwegian samples of healthy children (Andersson, 1996). Tasbihsazan, Nettelbeck, and Kirby (2003) have identified a likely reason that FTII scores correlate weakly with later IQ, namely, the test may possess poor reliability. In particular, for healthy, not at-risk infants, the test–retest stability coefficients for percent Novelty Preference were .29 for 12 infants tested at 27 and 29 weeks, –.07 for 12 infants tested at 29 and 39 weeks, and –.17 for 13 infants tested at 39 and 52 weeks. These stability coefficients are not just low—they are indistinguishable from zero, which raises doubts as to the soundness of the FTII instrument. The FTII may perform better as a screening test than as a general predictor of childhood intelligence. With regard to screening infants at risk for developmental disability, Fagan, Singer, Montie, and Shepherd (1986) reported very positive findings in a study of 62 infants who experienced adverse factors such as premature birth or maternal diabetes. When evaluated at 3 years of age, eight children revealed cognitive delay (IQ ≤ 70), whereas 54 were considered

normal. The FTII, previously administered between 3 and 7 months of age, correctly detected 6 of the 8 children with delay (75 percent sensitivity) and suitably identified 49 of 54 normal children (91 percent specificity). However, not all FTII screening studies of at- risk infants are positive in tone. For example, McGrath, Wypij, Rappaport, Newburger, and Bellinger (2004) used FTII scores from 1 year of age to predict low IQ at age 8 in 100 at-risk infants and found poor sensitivity of 32 percent in detecting cognitive delay (IQ ≤ 85) but fair specificity of 80 percent. Yuan (2002) published Chinese norms for the FTII and found a strong concurrent validity coefficient of .72 for 73 infants tested with the Bailey-II. Further research is needed before we abandon traditional infant measures in favor of the Fagan test and similar measures. 7.4 SCREENING FOR SCHOOL READINESS Screening for school readiness is a controversial practice. One concern expressed by some

parents is that results from screening tests might be used to delay entry into the school system, or to hold a child back a year. These are fateful decisions with the potential for long-term impact, either good or bad. Another concern is that children might be permanently labeled as slow learners or cognitively delayed. Underlying the entire controversy is the confounding complexity of definition. What is school readiness? Implicitly or explicitly, experts work from at least five different models when defining school readiness. Each model dictates a distinctive approach to assessment and intervention. Community Research Partners (2007) provide an excellent summary of the five approaches, which we paraphrase below: • Maturationist Model: School readiness is a

biological issue, a question of cognitive, psychomotor, and emotional maturation that stem directly from unfolding biological maturation. Because age is the best single indicator of biological maturation, some states use this viewpoint as a basis for

defining school entry by age and not using readiness assessments.

• Environmental Model: In this view, school readiness is based on children’s acquisition of skills learned from early socialization experiences, especially with parents and family members, which vary from child to child. This model supports the inclusion of parental involvement in school readiness assessments.

• Constructivist Model: In this approach, advocates see readiness as the extent to which children can learn tasks by interacting not just with teachers, but also with more knowledgeable peers and adults. This model supports an inclusive approach (parents, teachers, other adults) in the assessment process.

• Cumulative-Skills Model: This model views school readiness as a matter of the extent to which children possess important prerequisite skills necessary for learning foundational subjects such as reading and math. Policies that require assessment of

pre-academic skills upon entrance into kindergarten flow from this approach.

• Ecological Model: This is a holistic methodology that views school readiness as an interaction between developmental status and children’s environments. In other words, readiness does not reside within the child alone, but stems as well from an interaction with the readiness of families, communities, services, and preschool settings. Within this model, assessment for readiness is a complex, qualitative evaluation that involves the wider community.

In this section, we will survey a variety of screening tests, keeping in mind the complexity of the issues involved in preschool screening. Children with low intelligence are substantially at risk for school failure, which explains why individual intelligence tests play an important role in the evaluation of preschool children. But individual intelligence tests require a substantial commitment of time (up to two hours) and must be administered by carefully trained practitioners. For practical reasons, then,

individual intelligence tests are not suitable as screening instruments. The ideal screening instrument is a short test that can be administered by teachers, school nurses, and other individuals who have received limited training in assessment. In addition, a sensible screening test is one that provides a cutoff score that is accurate in classifying children as normal or at risk. In the context of screening tests, two kinds of errors can occur. Normal children who fail the test would be referred to as false-positive cases (because they are falsely classified as positive for potential disability). At-risk children who pass the test would be referred to as false-negative cases (because they are falsely classified as negative for potential disability). The reader must keep in mind that the purpose of screening is merely to identify children in need of additional evaluation, which means that false-positive cases will receive further evaluation. Hence, a false-positive misclassification rarely leads to undesirable consequences. However, false- negative cases typically do not receive further

evaluation, so this kind of misclassification is potentially more serious—because a needy child is deemed to be normal. Glascoe (1991) recommends that a useful instrument should yield a false-negative rate of less than 20 percent (meaning that 80 percent of truly at-risk children are flagged by the test) and an even lower false-positive rate of less than 10 percent (meaning that 90 percent of normal children pass the test). Glascoe and Shapiro (2005) outline five common pitfalls of developmental and behavioral screening in infancy and early childhood: • Waiting until the problem is observable.

Some clinicians use a screening test only after the problem is manifest—a waste of time and effort.

• Ignoring screening results. Practitioners may adopt a “wait and see” outlook—early intervention is then pointlessly postponed.

• Relying on informal methods. Clinicians often employ their own informal methods—

consequently, children in need of services go undetected.

• Using inappropriate tests. Some clinicians sparingly use long batteries instead of screening tests—as a result, children with disabilities are overlooked.

• Assuming services are limited or nonexistent. Practitioners often incorrectly assume that services are not available— consequently, they are reluctant to administer screening tests.

These pitfalls lead to two adverse outcomes: underdetection of developmental problems and delayed discovery of disabilities. In both cases, needy infants and children do not receive the services they need. Qualities of a Good Preschool Screening Instrument What are the qualities of a good preschool screening instrument? School readiness involves a number of broad areas, including motor, language, cognitive, social, and emotional functioning. Success in early schooling requires

that children function at or near age-appropriate levels in all these areas. Thus, a useful screening tool must address at least a few of these prerequisite domains. In addition to appropriate coverage, other qualities are needed in a suitable preschool screening tool as well. For example, the Minnesota Interagency Developmental Screening Task Force—a leading advocacy group in preschool screening—has published extensive standards by which it recommends and approves screening instruments (www.health.state.mn.us). The following list of criteria is modeled loosely on their recommendations: • The primary purpose is screening rather than

assessment, diagnosis, or prediction of academic success.

• Screening is provided in most or all of these areas: motor, language, cognitive, social, and emotional functioning.

• Overall test–retest reliability coefficient is a minimum of .70, preferably higher.

• Concurrent validity against a comprehensive assessment is a minimum of .70, preferably higher.

• Sensitivity and specificity of “at risk” and “not at risk” classifications, respectively, are both at least .70.

• Practicality and ease of administration are built in, with testing time of 30 minutes or less.

• Cultural, ethnic, and linguistic sensitivity is evident, that is, the test accurately screens children from diverse cultures.

• Minimum expertise is required for administration, that is, the test is suitable for paraprofessionals to administer.

The Interagency Task Force further notes that social-emotional domains embedded within current screening instruments do not demonstrate sufficient reliability and validity to determine if a child needs further assessment. Thus, separate instruments may be required to determine if children are “at risk” for school failure due to social-emotional difficulties.

Instruments for Preschool Screening As noted by Meisels and Atkins-Burnett (2005), dozens of instruments have been produced to screen for developmental delays, but only a few have withstood the test of time. In Table 7.5 we summarize a few recommended tools (Glascoe, 2005; Meisels & Atkins-Burnett, 2005). An interesting feature of these evaluations is that nearly all of them are available in multiple languages, including Spanish, French, Korean, Vietnamese, Laotian, Cambodian, Hmong (the language of the ethnic group from mountainous regions of southeast Asia), and Tagalog (the language of the Philippines). These tools reflect the increasing diversity of American culture and the desire to provide adequate school-based services to recent immigrants. TABLE 7.5 A Sample of School Readiness Screening Tests

Ages and Stages Questionnaire (Brookes Publishing Company) Birth to 60 months; parent report of language, cognition, personal- social, and motor skills; available in English, Spanish, French, and Korean; takes 10 to 20 minutes; clerical or paraprofessional tester. Brigance Screens (Curriculum Associates) Birth to 60 months; observation of social- emotional skills, speech-language, motor, readiness, and general knowledge; available in English, Spanish, Laotian, Vietnamese, Cambodian, and Tagalog; takes 15 to 20 minutes; consult online training module before scoring. Early Screening Inventory-Revised (Pearson Assessments) 36 to 60 months; observation of visual motor/adaptive, language and cognition, and gross motor skills; available in English and Spanish; takes 15 to 20 minutes; screeners and scorers can be trained with a manual and video. FirstSTEP Preschool Screening Tool (Pearson Assessment) 33 to 62 months; observation of cognitive, communication, and motor domains and classifications of: within acceptable limits, caution, or at-risk; available

We limit our discussion here to just three tests: the DIAL-3 (Developmental Indicators for the Assessment of Learning-III), the Denver II (a revision of the Denver Developmental Screening Test-Revised), and the HOME (Home Observation for the Measurement of the Environment). The first two tests use conventional approaches for the identification of developmental delay, whereas the third instrument, the HOME, embodies a radical departure from traditional procedures. 7.5 DIAL-4 The Developmental Indicators for the Assessment of Learning-4 is an individually administered test designed for the quick and efficient screening of developmental problems in preschool children ages 2:6 through 5:11 (Mardell & Goldenberg, 2011). The test screens for difficulties in five areas, including direct behavioral assessment of three major developmental domains: motor, concepts, and language. Items in these domains are administered directly to the child by the

examiner. Two additional domains (self-help and social-emotional) are appraised by means of questionnaires filled out by a parent (or both parents jointly) and a teacher. For children who have not yet entered kindergarten, the teacher form is filled out by a preschool teacher. If the child has not been to preschool, test results still are beneficial. Examples of items within the five domains include the following: • Motor: Fine-motor items include block

building, cutting, copying shapes and letters, name writing, and finger touching; gross- motor items include catching, jumping, hopping, and skipping.

• Concepts: Pointing to named body parts, naming or identifying colors, rote counting, counting blocks, positioning blocks, identifying concepts, and sorting shapes.

• Language: Giving personal information (name, age, sex), naming objects and actions, proper articulation, and phonemic awareness (e.g., rhyming).

• Self-Help: Parent and teacher fill out separate questionnaires with items relevant

to the child’s personal care skills, such as eating, grooming, and dressing.

• Social-Emotional: Parent and Teacher fill out separate questionnaires with items relevant to the child’s social skills with other children and parents, such as sharing, empathy, self-control, and rule compliance.

The DIAL-4 is available in both English and Spanish, although standardization is now based on the combined normative sample, that is, separate norms are not provided. The decision to develop unified norms was carefully considered during test development, and based on recognized requirements of school districts that serve substantial proportions of Spanish speaking/bilingual children of Hispanic origin. The large norm sample was obtained nationwide, roughly stratified by key demographics such as race and parental education. Because children are changing so quickly in preschool and early school years, norms are provided at two-month intervals. Scoring for some items is discrete and objective, whereas for other questions the scoring criteria

in the manual leave room for subjective interpretation, which detracts from the reliability of the instrument. A total score of direct academic relevance is obtained by summing the first three area scores (motor, concepts, language). The test yields a total of eight scaled scores (mean of 100, SD of 15). Table 7.6 depicts a 4-year-old boy with language delay and problems with social development. An interesting feature of this case is that the teacher perceives the boy as further behind than the parents do for both self-help and social development. This disparity might facilitate useful discussion in planning for academic intervention. In addition to the eight standard scores depicted here, the DIAL-4 provides a wealth of additional information such as raw scores, cut- off scores, and percentile ranks. A key feature of the test is that for each of the eight areas shown, the manual provides cutoff scores for assigning the child to one of two outcome groups labeled “potential delay” and “okay.” A finding of “potential delay” in one or more areas is a

starting point for further discussion, not a mandate for any high-stakes decision-making. The publisher offers computer scoring and generation of reports by means of a secure internet service known as Q-global. This yields a printout of results and a Report to Parents which can be helpful in discussion of the child’s progress among parents, care-givers, school psychologists, and teachers. A short version of the test cleverly called Speed Dial Screener is available, which cuts testing time of about 40 minutes in half. However, the trade-off of reducing testing time by decreasing the number of test items (which unavoidably diminishes scale reliability) may not be a prudent exchange. TABLE 7.6 DIAL-4 Scaled Score Results for a 4-Year-old Boy with Language Delay and Social-emotional Problems

Respondent Performance Area

Standard Score

Motor 110 Child Concepts 95

Language 63 Total 89

Independent research on the DIAL-4 is scant at this time. A search of PsychINFO for articles with DIAL-4 in the title did not yield a single hit. Even so, the latest release is only a minor departure from its predecessor, hence, reliability and validity evidence for the DIAL-3 buttress the standing of the new edition. Reliability of the DIAL-3 is fair, given that it is a brief test for screening purposes. Internal consistency coefficients range from .66 for Motor to .84 for Concepts, with a total scale reliability of .87. Test–retest data are similar, which is to say, not up to the suggested minimum reliability of .90 for tests used to

Questionnaire Results

Self-Help 104 Parent

Social- emotional

77

Self-Help 88 Teacher

Social- emotional

65

make individual decisions (Nunnally & Bernstein, 1994). Validity of the instrument has been evaluated along the familiar lines of content, construct, and criterion-related. Content validity is judged to be high insofar as a panel of experts provided content reviews and helped eliminate inappropriate and biased items. Criterion-related validity is strong, as judged by correlations with similar instruments such as the Early Screening Profiles, Differential Abilities Scale, and Peabody Picture Vocabulary Test-IV. A recent study favorably evaluates the construct validity of the DIAL-3 through confirmatory factor analysis (Assel & Anthony, 2009). As noted, the instrument was designed to screen for developmental delays in three domains: motor abilities, conceptual knowledge, and language competence. An essential feature of the test is that separate scores are reported for each domain. These domains and the 21 subtests comprising them were rationally preconceived by the test authors. An important question is whether the 21 subtests “hang together” statistically in a manner that supports the

rational grouping into the three domains provided by the test developers. In other words, do the three domains possess a latent reality, or are they merely figments of the imaginations of the test developers? Using test results for 1,560 children ages 3 to 6, Assel and Anthony (2009) found an excellent fit between the three domains traditionally reported on the DIAL-3 and three empirically derived domains found through factor analysis, which supports the construct validity of the test. However, these authors did note that Articulation subtest was a poor index of language competence, and the Catching subtest was a poor index of motor abilities. Further, the authors found that Name Writing, Rapid Color Naming, and Letters/Sounds demonstrated floor effects, that is, even the easiest items on these subtests were failed by young, low-socioeconomic status, and minority children. These findings indicate the need for adding simpler items on these subtests for future revisions of the test. The DIAL-3 also comes in a Spanish version that is separately validated on

a sample of 588 Spanish-speaking Head Start children (Anthony & Assel, 2007). It is with regard to practical utility that the DIAL-4 and its previous editions have raised the greatest skepticism. The value of a screening test is best judged by the extent to which it accurately identifies children in need of further developmental assessment, and accurately identifies children who are normal as normal. One useful statistic is sensitivity, which is the proportion of confirmed problem cases accurately “flagged” as problem cases by a test (i.e., children with delay who are accurately classified as “potential” delay). Unfortunately, brief screening tests such as the DIAL-4 do not reveal strong sensitivity when the recommended cutoffs are used to identify children as showing “potential delay.” For example, sensitivity of the DIAL-4 is reported to be in the range of .73 to .82, depending on the target group being researched (Mardell & Goldenberg, 2011). Put another way, 18 to 27 percent of at-risk children will be missed.

Another useful statistic is specificity, which is the proportion of normal cases accurately identified as normal. For the DIAL-4, specificity is reported to be in the range of .82 to .86, depending on the scales and the comparison groups used (Mardell & Goldenberg, 2011). Stated in the converse, what these data mean is that 14 to 18 percent of the (sizable) samples of normal children initially will be flagged as “potential delay.” These false-positive identifications will cause anxiety for the parents and likely trigger the need for additional consultation and testing. The only way to achieve higher sensitivity is to liberalize the cutoff scores, that is, classify a larger proportion as showing “potential delay.” But for any single test at one point in time, sensitivity and specificity are inversely related. As one goes up, the other must go down. There is simply no way around this psychometric reality except to design a better, longer, and much more comprehensive test. But then the test becomes the gold standard for the thing being evaluated, and is no longer a screening test. In

sum, increasing sensitivity inevitably will reduce specificity (percentage of normal children correctly identified as normal). This will cause many over-referrals (children identified as “potential delay” who actually are normal). Denver II The Denver II (Frankenburg, Dodds, Archer, and others, 1990) is an updated version of the highly popular Denver Developmental Screening Test-Revised (Frankenburg, 1985; Frankenburg & Dodds, 1967). The Denver test is probably the most widely known and researched pediatric screening tool in the United States. The instrument is popular worldwide—it has been translated into 44 different languages. Suitable for infants and children aged 1 month to 6 years, the test consists of 125 items in four areas: personal-social, fine motor-adaptive, language, and gross motor. The items are a mix of parent report, direct elicitation, and observation. Each item is arranged chronologically on the test by age of the child

and marked pass/fail. Testing begins at an age- appropriate level and continues until the child fails three items. Total time for evaluation is 20 minutes or less. Unlike other screening tests, the Denver II does not produce a developmental quotient or score. Instead, results on about 30 age-appropriate items provide a score that can be interpreted as normal, questionable, or abnormal in reference to age-based norms. A category of “untestable” also is included. The standardization sample consisted of 2,096 children, all from the state of Colorado, stratified by age, race, and socioeconomic status. Reliability of the Denver II is reported to be outstanding for a brief screening test. Interrater reliability among trained raters averaged an outstanding .99. Test– retest reliability for total score over a 7- to 10- day interval averaged .90. The Denver possesses excellent content validity insofar as the behaviors tested are recognized by authorities in child development as important markers of development. However, the test interpretation categories (normal, questionable,

abnormal) were based on clinical judgment and therefore await additional study for validation. A few initial studies raise significant concerns. Glascoe and Byrne (1993) evaluated 89 children in day care settings who were 7 to 70 months of age. Based on extensive independent evaluation, 18 of these 89 children were confirmed to have developmental delays according to federal definitions of disabling conditions (e.g., language delays, mental retardation, and autism). While the Denver II functioned well in correctly identifying 15 of the 18 at-risk children, the instrument performed poorly with the normal children. In fact, 38 of the 71 normal children failed the test and were classified as questionable or abnormal. Overall, almost four in six children taking the test would be referred for additional assessment, and of the four, only one would have a true disability. The researchers recommend further validational study with recalibration and possible discarding of some test items before the test receives widespread use. Other reviewers are even more skeptical. For example, a blue-ribbon review

panel of the Minnesota Interagency Developmental Screening Task Force flatly concluded that the Denver-II is not suitable for developmental and social-emotional screening of preschool children (www.health.state.mn.us). Home The Home Observation for Measurement of the Environment (HOME), popularly known as the HOME Inventory, is probably the most widely used index of children’s environment. Based on in-home observation and an interview with the primary caretaker, the instrument provides a measure of children’s physical and social environments. The HOME Inventory comes in three forms: Infant and Toddler, Early Childhood, and Middle Childhood. The latest editions of the instrument, dated 1984, emerged after 15 years of methodical revision and refinement (Caldwell & Richmond, 1967; Caldwell & Bradley, 1984, 1994). Background and Description Prior to the development of the HOME Inventory, the measurement of children’s

environments was based largely upon demographic data such as parental education, occupation, income, and location of residence. Often these indices were combined into a cumulative measure referred to as social class or socioeconomic status. For example, Hollingshead and Redlich (1958) developed a continuum of social class derived from residence, occupation, and education of the head of the household. The SES score for a family whose household head worked at a clerical job, was a high school graduate, and lived in a middle-rank residential area would be computed as follows (Hollingshead & Redlich, 1958):

Factor Scale Value ×

Factor Weight =

Partial Score

Residenc e

3 6 18

Occupati on

4 9 36

Educatio n

4 5 20

Index of Socioeconomic Status = 74

For research purposes, social scientists may categorize families into a fivefold hierarchy of social classes (classes I through V) based on the total score. The reader will notice that the Hollingshead and Redlich measure was derived entirely from status indices. The unstated assumption is that these indices reflect, indirectly, meaningful environmental variation. Put bluntly, proponents of SES as an environmental measure believe that, on average, children from a higher social class will experience a richer and more nurturant environment than children from a lower social class. In contrast to the SES approach, the HOME Inventory was developed to provide a direct process measure of children’s environments. The guiding philosophy of this instrument is that direct assessment of children’s experiences is a better index of the home environment than such indirect measures as parental occupation and education. Although it is true that social class—as embodied in occupation, education, residence—provides an oblique measure of

environmental richness, the authors of the HOME Inventory would argue that direct assessment of children’s experiences provides a more accurate index of variations in the home environment. Thus, assessment with the HOME involves, in part, direct observation of children’s home environments to determine whether certain types of crucial interactions and experiences are present or absent. For example, during an hour-long visit, the examiner observes whether the parent spontaneously communicates with the child at least five times, determines whether the child has at least 10 children’s books or story records, and assesses whether the neighborhood is esthetically pleasing according to detailed standards, to cite just a few examples. The purpose of the HOME Inventory is to measure the quality and quantity of stimulation and support for cognitive, social, and emotional development available to the child in the home. The scales and items of the HOME were derived from a list of environmental processes identified from existing research and theory as important

for optimal childhood development (Caldwell & Bradley, 1984). These growth-promoting processes include basic need gratification; frequent contact with a relatively small number of adults; a positive emotional climate that fosters trust of self and others; appropriate, varied, and patterned sensory input; consistency in the physical, verbal, and emotional responses of others; a minimum of social restrictions on exploratory and motor behavior; structure and order in the daily environment; provision and adult interpretation of varied cultural experiences; appropriate play materials and environment; contact with adults who value achievement; and the cumulative programming of experiences to match the child’s developmental level (Caldwell & Bradley, 1984). In brief, then, the purpose of the HOME is to measure specific, designated patterns of nurturance and stimulation available to children in the home. In order to complete the HOME Inventory, the examiner must observe the child and caregiver (usually the mother) interacting in the home

environment. Ratings for a few inventory items are derived from observation of the physical environment. In addition, completion of some items is based upon self-report of the caregiver. Items are dichotomously scored, 1 for present, 0 for absent. For example, one item asks whether the child is included in grocery store shopping at least once a week. The manual for the inventory encourages a relaxed, semistructured approach to observation and interview (Caldwell & Bradley, 1984). Completion of the inventory takes about an hour. The three forms of the HOME are Infant and Toddler (ages 0 to 3 years), Early Childhood (ages 3 to 6 years), and Middle Childhood (ages 6 to 10 years). The Infant and Toddler form consists of 45 items organized into the following six subscales: • Emotional and Verbal Responsivity of

Parent • Acceptance of the Child’s Behavior • Organization of the Environment • Provision of Appropriate Play Materials • Parent Involvement with Child

• Variety of Stimulation The Early Childhood version consists of 55 items organized into eight subscales, whereas the Middle Childhood version consists of 59 items organized into eight subscales. Technical Features Relevant norms for the HOME Inventory are available from several sources. For the Infant and Toddler version, Caldwell and Bradley (1984) report subscale means and standard deviations for 174 families from Little Rock, Arkansas. Compared to the general population, this sample appears to overrepresent lower-SES families. For example, 34 percent of the families were on welfare and 29 percent were single- parent households. For the Early Childhood version, standardization data were available from 232 families in Little Rock, with lower- SES families similarly overrepresented. For the Middle Childhood version, Bradley and Rock (1985) report subscale means and standard deviations for 141 families from Little Rock. Approximately half of these families were

African American, the remainder Caucasian; boys and girls were sampled equally. These families were thought to be representative of all families rearing elementary-aged children in Little Rock, Arkansas. However, for all three versions it is clear that the standardization samples provide only local norms. These data may be useful as points of reference but should not be equated with a stratified, random, national sample. The reliability of the HOME Inventory has been demonstrated in a variety of ways, particularly for the Infant and Toddler version, which we discuss here. The authors note that short-term test–retest studies are inappropriate, since a respondent is quite likely to remember a specific answer given to a question, which would artificially inflate test–retest correlations (Bradley & Caldwell, 1984). Methods used for the assessment of reliability included interobserver agreement, internal consistency, and long-range test–retest stability coefficients for 91 families from the standardization sample. By definition, interobserver agreement for the

subscale items is reported to be 90 percent or higher, since this is the training criterion for new raters. Internal consistency estimates using Kuder-Richardson formula 20 ranged from .67 to .89 for all subscales except Variety of Stimulation, which yielded a coefficient of only .44. This rather low reliability coefficient was due to the small number of items in the subscale (five). Test–retest data were available from 91 families tested when their infant/toddler was 6, 12, and 24 months of age. The coefficients indicated a moderate to high degree of stability for the subscales, with most correlations in the .50s, .60s, and .70s. The correlation between total score for testings at 12 and 24 months of age was a highly respectable .77. The validity of the HOME Inventory has been bolstered by research findings that show modest correlations with SES indices. Because the inventory was proposed as a more meaningful, sensitive index of environment than social class, HOME scores should be significantly but not highly related to SES indices. For the Infant and Toddler version, HOME Inventory subscale

correlations with SES are mainly in the .30s and .40s, while the total score–SES correlation is .45 (Bradley, Rock, Caldwell, & Brisby, 1989). HOME scores also revealed a strong relationship with poverty status in Caucasian and minority samples (Bradley, Corwyn, Pipes McAdoo, & Garcia Coll, 2001). Furthermore, higher HOME scores predicted that children would exhibit fewer behavior problems and better preschool ability in a study of 93 single African American mothers (Jackson, Brooks- Gunn, Huang, & Glassman, 2000). HOME scores also show strong, theory- confirming relationships with appropriate external criteria, including language and cognitive development, school failure, therapeutic intervention, and mental retardation (Caldwell & Bradley, 1984). The correlations between HOME scores and intellectual measures such as the Stanford-Binet are particularly informative. In one study of 174 families, the total score on the HOME at 12 months of age correlated a robust r = .58 with Stanford-Binet IQ at 36 months of age. Factor-

analytic studies of the HOME also support the construct validity of this instrument (Bradley, Mundfrom, Whiteside, and others, 1994). In sum, the HOME inventory shows promise not only in research but also as a practical adjunct to intervention. TOPIC 7B Testing Persons with Disabilities 7.6 Origins of Tests for Special Populations 7.7 Nonlanguage Tests 7.8 Nonreading and Motor-Reduced Tests Case Exhibit 7.1 The Challenge of Assessment in Cerebral Palsy 7.9 Testing Persons with Visual Impairments 7.10 Testing Individuals Who Are Deaf or Hard of Hearing 7.11 Assessment of Adaptive Behavior in Intellectual Disability 7.12 Assessment of Autism Spectrum Disorders

In this topic we discuss instruments designed for exceptional and difficult consultations, such

as persons with sensory/motor impairment, recent immigrants from non-English-speaking countries, and individuals with significant intellectual deficiencies. According to the U.S. Census Bureau, about 32 million Americans over the age of 5 (one in eight) have a sensory, physical, mental, or self-care disability (www.census.gov, 2000). This estimate does not include persons living in institutions. In these extraordinary circumstances—evaluating persons with sensory, motor, language, or intellectual disability—specialized tests are needed for valid assessment. However, before introducing specific instruments, we examine a background issue: How did these instruments arise? 7.6 ORIGINS OF TESTS FOR SPECIAL POPULATIONS Beginning in the 1950s, a renewed commitment to the needs and rights of physically and mentally disabled persons arose in the United States (Maloney & Ward, 1979; Patton, Payne, & Beirne-Smith, 1986). Societal attitudes

toward those with special needs shifted from outright disdain to a more supportive stance that favored new programs and initiatives on behalf of the disabled. Progress has been slow, but we are no longer surprised to see bathroom facilities with wheelchair access for persons with physical disability, large-print books for persons with visual impairments, or closed- captioned television programs for persons with hearing disabilities. Furthermore, the special needs of citizens with mental retardation are increasingly served by small community care facilities instead of massive, impersonal institutions. In the early 1970s, the renewed concern for the needs of disabled persons was translated into federal legislation. In 1973, Public Law 93-112 was passed, serving as a “Bill of Rights” for individuals with disabilities. This legislation outlawed discrimination on the basis of disability. Two years later, the landmark Education for All Handicapped Children Act (Public Law 94-142) was enacted. This legislation mandated that disabled

schoolchildren receive appropriate assessment and educational opportunities. In particular, psychologists were directed to assess children in all areas of possible disability—mental, behavioral, and physical—and to use instruments validated for those express purposes. We turn now to a review of tests that can be used for the assessment of persons with sensory, motor, or mental disabilities. 7.7 NONLANGUAGE TESTS Nonlanguage tests require little or no written or spoken language from examiner or examinee. Thus, they are particularly suited for assessment of non-English-speaking persons, referrals with speech impairments, and examinees with weak language skills. These instruments can also be used as supplementary tests for examinees who have no disabilities. Leiter International Performance Scale- Revised The Leiter International Performance Scale- Revised (LIPS-R, Roid & Miller, 1997) is a revision of a classic and highly praised test of

nonverbal intelligence and cognitive abilities (Leiter, 1948, 1979). Leiter devised an experimental edition of the test in 1929 to assess the intelligence of those with hearing or speech impairment, those who were bilingual, or non- English-speaking examinees. The scale was field-tested with several ethnic groups in Hawaii, including children of Japanese and Chinese descent. The first edition was based on test results for American children, high school students, and World War II Army recruits. Although highly praised and widely used after its initial release, this test received strong criticism in recent years because of poor illustrations and outdated norms. The revised Leiter answers all criticisms handily, and the LIPS-R deserves wide use as a culture-reduced measure of nonverbal intelligence. A remarkable feature of the Leiter is the complete elimination of verbal instructions. The Leiter-R does not require a single spoken word from the examiner or the examinee. With an age range of 2 years to 20 years and 11 months, the Leiter-R is particularly suitable for children and

adolescents whose English language skills are weak. This includes children with any of these features: non-English-speaking, autism, traumatic brain injury, speech impairment, hearing problems, or an impoverished environment. The test is also useful in the assessment of attentional problems, as described in the following. Testing is performed by the child or adolescent matching small laminated cards underneath corresponding illustrations on an easel display (Figure 7.1). The test is untimed. Because the initial items are transparently obvious, most examinees catch on quickly without need of pantomime demonstration. The Leiter-R contains 20 subtests organized into two batteries: Visualization and Reasoning, and Memory and Attention. The 10 subtests of the Visualization and Reasoning Battery are described in Table 7.7. Not all subtests are administered to every child. For example, the figure rotation subtest is too difficult for 2-year- olds and the immediate recognition subtest is too easy for adolescent examinees. The four

Reasoning subtests include classification and design analogies. The six Visualization subtests include matching, figure-ground, paper folding, and figure rotation. The eight Memory subtests include memory span, spatial memory, associative memory, and delayed recognition memory. The two Attention subtests consist of an underlining test (e.g., marking all squares printed on a page full of geometric shapes) and a measure of divided attention (e.g., observing a moving display and simultaneously sorting cards correctly).

FIGURE 7.1 A Characteristic Item from the Leiter International Performance Scale- Revised The Leiter-R yields a composite IQ with the familiar mean of 100 and standard deviation of 15. The test also produces subtest scaled scores with a mean of 10 and standard deviation of 3, as well as a variety of composite scores useful in clinical diagnosis. The test was normed on

over 2,000 children and adolescents, from 2 to 21 years of age. Using 1993 census statistics, these subjects were carefully stratified according to race, age, gender, social class, and geographic region. Internal consistency reliability for subtests, domain scores, and IQ scores is excellent. Typical coefficient alphas are in the high .80s for subtests and the low .90s for domain scores and IQ scores. Extensive studies of item bias reveal that the items appear to function similarly in separate racial groups (white, African American, and Hispanic samples); that is, there is no evidence of bias (defined as differential item functioning). Coupled with the fact that the test is completely nonverbal, the absence of test bias indicates that the Leiter-R is a good choice for culture-reduced testing of minority children. But the test is useful in a wide range of other situations as well. For example, Hanzel (2003) recommends the Leiter-R for the evaluation of children with autistic disorder, a syndrome discussed later in the chapter.

TABLE 7.7 Visualization and Reasoning Subtests of the Leiter-R

1.Figure Ground: Identification of designs or figures embedded within a stimulus. (All ages)

2.Design Analogies: Like the matrix analogies subtests found on many cognitive tests. (Ages 6 to 20)

3.Form Completion: Ability to recognize objects from fragmented line drawings. (All ages)

4.Matching: Matching and discrimination of simple visual stimuli. (Ages 2 to 10)

5.Sequential Order: Logical progression of pictorial or figural items. (All ages)

6.Repeated Patterns: Identify the missing part of a repeated pattern of figural items. (All ages)

7.Picture Context: Using visual cues to identify a pictured object that has been removed. (Ages 2 to 5)

8.Classification: Categorization of objects or geometric designs. (Ages 2 to 5)

9.Paper Folding: Ability to mentally “fold” an item shown in unfolded two-dimensional form. (Ages 6 to 20)

10.Figure Rotation: Capacity to mentally rotate a two-or three-dimensional object. (Ages 11 to

Empirical research with the Leiter-R is largely supportive at this time. The test has been shown to have utility in the assessment of medically fragile children (Hooper, Hatton, Baranek, Roberts, & Bailey, 2000), the assessment of low-functioning children with autism (Tsatsanis, Dartnall, Cicchetti, and others, 2003), and the evaluation of children classified as language impaired (Farrell & Phelps, 2000). In this latter study, the Leiter-R also demonstrated a validity- confirming correlation of r = .80 with another nonverbal measure of intelligence. Further, in testing with ethnic minorities, the Leiter-R appears to avoid the confounding of intellectual assessment with English language proficiency that is common with other tests. For example, one study of 47 Spanish-speaking and 47 English-speaking children reported average WAIS-III IQs of 94 versus 88, respectively, whereas average Leiter-R IQs were nearly identical, 98 versus 99 (Cathers-Schiffman & Thompson, 2007). The Leiter-R is a welcome revision of an obsolete test. In the hands of a careful clinician,

the test is helpful in the intellectual assessment of children with weak skills in English. Other uses for the revised test include the assessment of attention-deficit/hyperactivity disorder (comparisons of the Attention subtests with the other domains are crucial here) and the evaluation of giftedness in young children (the extremely high ceiling of the test proves invaluable for this application). Whereas reviewers warned against using the original Leiter for placement or decision-making purposes (Sattler, 1988; Salvia & Ysseldyke, 1991), the revised Leiter is a huge improvement in regards to psychometric quality and standardization excellence. Thorough reviews of the Leiter-R and other nonverbal assessment instruments are provided by McCallum, Bracken, and Wasserman (2001). Human Figure Drawing Tests Most children enjoy drawing human figures and do so routinely and spontaneously. Since the early 1900s, psychologists have tried to tap into this almost instinctive behavior as a basis for

measuring intellectual development. The first person to use human figure drawing (HFD) as a standardized intelligence test was Florence Goodenough (1926). Her test, known as the Draw-A-Man test, was revised by Harris (1963) and renamed the Goodenough-Harris Drawing Test. More recently, the HFD technique has been adapted by Naglieri (1988). We should also mention that human figure drawings are widely used as measures of emotional adjustment, but we do not discuss that application here. The Goodenough-Harris Drawing Test is a brief, nonverbal test of intelligence that can be administered individually or in a group. Goodenough (1926) published the first edition of this test, while Harris (1963) provided important refinements in scoring and standardization, including the use of a deviation IQ. Strictly speaking, the Goodenough-Harris test doesn’t fit the criteria for nonlanguage tests insofar as the examiner must convey certain instructions in English or through a translator. However, the instructions are brief and basic (“I

want you to draw a picture of a man [or woman]; make the very best picture you can”). The Goodenough-Harris test is, for all practical purposes, a nonlanguage test. The purpose of the Goodenough-Harris Drawing Test is to measure intellectual maturity, not artistic skill. Thus, the scoring guide emphasizes accuracy of observation and the development of conceptual thinking. The child receives credit for including body parts and details, as well as for providing perspective, realistic proportion, and implied freedom of movement. The 73 scorable items are transformed to a scaled score with the familiar mean of 100 and standard deviation of 15. Of course, these norms, developed in the 1960s, are now thoroughly outdated. Even so, a large body of research confirmed that the test captured something important. For example, Frederickson (1985) reported correlations between Goodenough-Harris Drawing Test scores and WPPSI Full Scale IQ in the range of .72 to .80. In several other studies, correlations

with individual IQ tests are more variable, but the majority are over .50 (Abell, Briesen, & Watz, 1996; Anastasi, 1975). In response to criticisms of the Goodenough- Harris Drawing Test, Naglieri (1988) developed a quantitative scoring system and renormed the human figure drawing procedure. His scoring system, The Draw A Person: A Quantitative Scoring System (DAP), was normed on a sample of 2,622 individuals ages 5 through 17 years who were representative of the 1980 U.S. Census data on age, sex, race, geographic region, ethnic group, social class, and community size. The DAP yields standard scores with the familiar mean of 100 and standard deviation of 15. In a study of 61 subjects ages 6 to 16 years, the DAP correlated .51 with WISC-R IQ and produced similar overall scores, with a mean IQ of 100 versus mean DAP score of 95 (Wisniewski & Naglieri, 1989). Lassiter and Bardos (1995) found that the DAP score underestimated IQ scores obtained from the WPPSI-R and the K-BIT in a sample of 50 kindergartners and first graders.

Reviewers praise the DAP for its clear scoring system, strong reliability, and careful standardization (Cosden, 1992). However, results of validity studies are more cautionary. Harrison and Schock (1994) note that the accumulated evidence with HFD tests indicates low to moderate predictive validity. In spite of their popularity and appeal, HFD tests do not effectively identify children with learning difficulties or developmental disabilities, and they may not be valid for use even as screening measures. Hiskey-Nebraska Test of Learning Aptitude The Hiskey-Nebraska Test of Learning Aptitude (H-NTLA) is a nonlanguage performance scale for use with children aged 3 to 17 years (Hiskey, 1966). This test can be administered entirely through pantomime and requires no verbal response from the examinee. However, verbal instructions can be used with children with normal and mild hearing impairment. The H- NTLA consists of 12 subtests:

Raw scores on the subtests are converted into a Deviation Learning Quotient (LQ) with mean of 100 and standard deviation of 16. For a sample of 43 hearing-impaired children, the test–retest stability of the LQ scores was reported to be .79, .85, and .62 after intervals of about 1 year, 3 years, and 5 years, respectively, which is similar to data for normal children (Watson, 1983). Even so, more than one third of the sample showed a 15-point or greater change in scores over the 5-year time span, which demonstrates the importance of basing important decisions on more than a single measure. H-NTLA scores correlate quite robustly with achievement scales for grades 2 through 12 (median r = .49) and also with WISC-R Performance IQ (r = .85). Although the LQ yields average scores that are remarkably close

Bead Patterns Block Patterns Memory for Color Completion of Drawings Picture Identification Memory for Digits Picture Association Puzzle Blocks Paper Folding Picture Analogies Visual Attention Span Spatial Reasoning

to WISC-R Performance IQ for samples of children with hearing impairment and those who are deaf, the H-NTLA scores are substantially more variable (Phelps & Ensor, 1986). Thus, use of the H-NTLA may increase the risk of false- positive misclassification—labeling children as gifted when they are only bright or as having mental retardation when they are merely borderline. The H-NTLA is useful with children who are deaf, have speech or language impairments or mental retardation, or those who are bilingual. An interesting feature of this test is the development of parallel norms: The H-NTLA was standardized on 1,079 children who were deaf and 1,074 normal-hearing children aged 2½ to 17½. However, the chief weakness of the instrument is the inadequacy of these norms. For example, the representativeness of the sample of those who were deaf—picked on an opportunistic basis from schools for those who are deaf—is largely unknown. Standardization of the normal-hearing sample was based on occupational level of parents according to the

1960 U.S. Census. A contemporary and more detailed restandardization of the test would be quite helpful. Qu (1997) reports favorably on the reliability and validity of the test with huge samples of Chinese deaf children. Test of Nonverbal Intelligence-4 (TONI-4) The Test of Nonverbal Intelligence-4 (TONI-4) is a language-free measure of cognitive ability designed for disabled and language-impaired populations (Brown, Sherbenou, & Johnsen, 2010). By adding new items, the fourth edition realized a higher ceiling and a lower floor than the previous version. This is a pragmatic, brief, and simple measure that can be administered in 15 to 20 minutes. Because the response format can include any simple gesture such as nodding or pointing, the TONI-4 is well suited for persons who are deaf, language impaired, or physically limited. The authors recommend the test for assessing persons with aphasia, non- English speakers, and persons who have experienced a variety of severe neurological

traumas. The test instructions are pantomimed by the examiner and the examinee answers by pointing to one of six possible responses. For motorically impaired patients, the examiner can point to the alternatives, one by one, while awaiting a choice from the examinee (e.g., nod of the head, or even an eye blink from a paralyzed patient). The TONI-4 comes in two equivalent forms (A and B). Each form consists of 60 abstract or figural items that do not include pictures or cultural symbols. Except for a few simple- matching items, the TONI-4 items require the examinee to solve problems by identifying relationships among the abstract figures. Many of the items are similar in format to those found on Raven’s Progressive Matrices. The test yields three kinds of scores: age equivalents (for younger examinees), percentile ranks, and TONI-4 quotients (mean of 100 and standard deviation of 15). Suitable for persons aged 6:0 through 89:11, the standardization sample consisted of 2,272 people from 33 states stratified according to gender, race and ethnicity,

parental education, and socioeconomic status. Reliability data are satisfactory, with internal consistency coefficients typically exceeding .90 and alternate-forms reliability in the range of .80 to .95. Independent validity studies of the TONI-4 are scant, but investigation of prior editions (which are highly similar in content) is supportive of this test as a culture-reduced index of general intelligence. Overall, the TONI-4 is highly regarded as a brief nonlanguage screening tool for persons with impaired language abilities (e.g., aphasic, deaf, non-English-speaking, intellectually disabled). The test is more carefully standardized than most and possesses excellent reliability. A useful feature is that the untimed administration of TONI-4 rarely exceeds 20 minutes. Instructions are available in seven major foreign languages. For a review, see Ritter, Kilinc, Navruz, and Bae (2011). 7.8 NONREADING AND MOTOR-REDUCED TESTS

Nonreading tests are designed for illiterate examinees who can, nonetheless, understand spoken English well enough to follow oral instructions. Nonreading tests of intelligence are well suited to young children, illiterate examinees, and persons with speech or expressive-language impairments. These tests need not be specialized or esoteric: The performance subtests of most mainstream instruments qualify as nonreading tests. For example, examiners may use the WISC-III performance subtests to estimate the intelligence of examinees with language disabilities. However, clients with cerebral palsy or other orthopedically impairing conditions will score very poorly on nonreading tests that require manipulatory responses. Obtaining valid test results from such persons can present an enormous challenge (Case Exhibit 7.1). The motor deficits, increased tendency to fatigue, and inexactness of purposive movements common to persons with cerebral palsy will negatively affect their performance on cognitive assessment tools. Orthopedically impaired

clients need tests that are both nonreading and motor reduced. In particular, tests that permit a simple pointing response are well suited to the assessment of children and adults with cerebral palsy or other motor-impairing conditions. CASE EXHIBIT 7.1 The Challenge of Assessment in Cerebral Palsy The challenges inherent to special consultations are well typified by a client with cerebral palsy recently tested by a consulting psychologist. The young examinee was totally confined to a battery-powered wheelchair, except when a live- in attendant would transfer him to a bed or chair. Even a dispassionate observer would have to agree that the client didn’t look very capable, sitting hunched over in his chair, unable to control his drooling, one arm arched out at an awkward angle. Yet, in spite of his disability, he had achieved a fair degree of personal independence. Using a simple joystick control device, he could guide his wheelchair to the grocery store, library, and community center

where he would complete simple transactions by pointing to appropriate words and phrases in a plastic-bound spiral notebook. Because of his poor motor control, interactions with this client took quite a long time. Nonetheless, he was very efficient with short communications. Here is a typical exchange, with the client’s notebook- designated responses shown in capital letters: “I understand you have a new synthesized-voice

communication box, how do you like it?” YOU ASKED TWO QUESTIONS. “You’re right. I’ll bet that happens a lot. Do you have a communication box?” YES. “What do you think of it?” IT’S NOT EASY. “Now that we are done testing, should I find your driver?” NO, I’LL WAIT. HE IS COMING BACK.

How intelligent is this client? What is his level of verbal comprehension? How well does he understand abstract concepts? For example, is he capable of understanding the essentials of microcomputer usage such as data entry, file storage, and directory commands? Could he learn to program a

microcomputer? These are precisely the referral questions asked by a vocational rehabilitation counselor who was contemplating huge expenditures— thousands of dollars—to purchase a computer system for this disabled client.

Certainly, it would be easy to underestimate the potential of this young man with severe motor and language disabilities because— in a quite literal sense—his intelligence was hidden away, trapped inside his incapacitated body. The task of the examiner was to find the able mind inside the disabled body, a formidable challenge indeed. Using the Test of Nonverbal Intelligence and the Peabody Picture Vocabulary Test, the examiner determined that the young client possessed at least average intelligence and could likely learn the fundamentals of data processing with microcomputers.

Peabody Picture Vocabulary Test-IV

The Peabody Picture Vocabulary Test-IV (PPVT-4) is the best known and most widely used of the non-reading, motor-reduced tests (Dunn & Dunn, 1998). The PPVT-4 is used to obtain a rapid measure of listening vocabulary with persons who are deaf or who have neurological or speech impairments. Although the PPVT-4 is useful with any examinee who cannot verbalize well, the test is especially useful with examinees who also manifest motor- impairing conditions such as cerebral palsy or stroke. The PPVT-4 comes in two parallel versions, each consisting of 4 practice plates and 228 testing plates. Each plate contains four line drawings of objects or everyday scenes. The examiner presents a plate, states the stimulus word orally, and asks the examinee to point to the one picture that best depicts the stated word. The test items are precisely ordered according to difficulty level, arranged in 19 sets of 12 items each for efficient identification of basal and ceiling levels. The entry level is determined by age, and examinees continue until they reach

their ceiling level. Although the test is untimed, administration seldom exceeds 15 minutes. Raw scores are converted to age equivalents or standard scores (mean of 100, standard deviation of 15). The PPVT-4 was standardized on a representative national sample of 3,540 individuals ranging from 2½ to 90 or more years of age. Reliability data for the new edition are exceptionally strong, with typical internal consistency coefficients of .94, alternate-forms reliabilities of .89, and test–retest correlations of .93. Concurrent validity studies are also highly supportive, demonstrating robust correlations with verbal measures. For example, the test developers report a correlation of .7 with scores on the latest edition of the Clinical Evaluation of Language Fundamentals (CELF-4). The test developers of the PPVT-4 took great care to minimize and balance cultural influences in the test items. Independent consultants representing the perspectives of African Americans, Asians, Hispanics, Native Americans, and women reviewed the content

and artwork of the test during development, and adjustments were made following these reviews. The test items demonstrate attractive artwork that is balanced for racial and gender differences, including persons with physical disabilities. However, based on research with prior editions, the evidence is mixed as to whether the Peabody is a culturally fair instrument that serves as a valid measure with minority children. For example, Washington and Craig (1999) found that 59 African American preschoolers at risk for academic failure averaged 91 on the test (SD of 11), which was seen as commensurate with their environmental disadvantages. These authors laud the test as “culturally fair.” However, Campbell, Bell, and Keith (2001) reported an average score of 82 (SD of 12) for 416 African American children of low socioeconomic status, which was 8 points lower than their overall score on the K-ABC. These researchers concluded: “Despite the attempts to reduce racial differences, the PPVT- III appears to perform similarly to prior editions of the Peabody scales. On average, the PPVT-III

tends to underestimate both intellectual ability and scholastic achievement, as measured by the K-ABC, in low SES, African American children” (p. 91). Further research will be needed to clarify the utility of this test with minority children. Several lines of evidence support the validity of the Peabody test, but only as a narrow measure of vocabulary, not as a general measure of intelligence (Altepeter & Johnson, 1989). Dunn and Dunn (1981) sought to ensure content validity by searching Webster’s New Collegiate Dictionary for all words whose meanings could be represented by a picture. Thus, the authors had a specific content universe in mind, and the items from the Peabody appear to be a fair sampling from this domain. In addition, the authors used sophisticated item-selection techniques based on the Rasch-Wright latent- trait model to help build construct validity into the test. This model enables researchers to construct a growth curve for the latent trait being measured (hearing vocabulary) and to select items that best fit the curve. Using tryout

and calibration data, the curve was drawn repeatedly on a computer. If an item did not fit the Rasch-Wright latent-trait model (too flat or too steep an item-characteristic curve) it was discarded from consideration. Concurrent and predictive validity data for the Peabody are somewhat limited but promising. Several investigators have correlated the PPVT- R with achievement measures, where modest relationships (r’s from .30 to .60) are common (Naglieri, 1981; Naglieri & Pfeiffer, 1983). Correlations with reading achievement tend to be higher than with spelling and arithmetic achievement, suggesting that the PPVT-R has appropriate discriminant validity (Vance, Kitson, & Singer, 1985). Several investigators have correlated earlier versions of the Peabody with intelligence measures, particularly the WISC-R and WAIS- R, and healthy correlations (near .70) are the rule (e.g., Naglieri & Yazzie, 1983). As might be expected, correlations tend to be higher with Verbal IQ than Performance IQ.

In a very important and ingenious study, Maxwell and Wise (1984) investigated the vocabulary loading of the Peabody in a sample of 84 inpatients from psychiatry and psychology wards. Their study utilized the PPVT, but this earlier edition is similar to the PPVT-IV, so that the conclusions are pertinent here. The researchers investigated the hypothesis that the PPVT assesses more than vocabulary in adults. In addition to the PPVT, the researchers collected data on the following: WAIS-R, Wechsler Memory Scale, name-writing speed, and years of education. Name-writing speed is simply the number of seconds required for the examinee to write his or her full name. Even though all variables had significant correlations with PPVT IQ, WAIS-R Vocabulary had by far the strongest correlation (r = .88). More important, when the variance accounted for by Vocabulary was removed, none of the remaining variables had any predictive relationship with the PPVT. In short, the Peabody is a good measure of vocabulary (hearing vocabulary, in

particular) but could be misleading if used as a global measure of intellect. The PPVT-4 is a recent revision, so independent research with the test is limited. One caution with the previous edition, the PPVT-III, is that standard scores may be substantially lower than Wechsler IQs, particularly with persons with mental retardation and minority examinees. In a sample of 21 adults with mild mental retardation, Prout and Schwartz (1984) found the PPVT-R standard scores (mean of 56) to be an average of 9 points lower than the WAIS-R IQ (mean of 65). Naglieri and Yazzie (1983) found a huge 26-point difference with a sample of Navajo Indian children, who averaged a standard score of 61 on the PPVT-R in contrast to WISC-R IQ of 87. On a similar note, with the PPVT-III, Bell, Lassiter, Matthews, and Hutchinson (2001) found that the instrument tended to underestimate WAIS-III IQ scores of bright college students by about 10 points. Overall, we may conclude that the Peabody is a well-normed measure of hearing vocabulary that is useful with nonreading and motor-impaired

examinees. However, the instrument is not a substitute for a general intelligence test and PPVT-4 scores may underestimate intellectual functioning in some groups (e.g., minority children, high-functioning adults). 7.9 TESTING PERSONS WITH VISUAL IMPAIRMENTS Many millions of American adults have some degree of visual impairment, including more than 1 million individuals who are legally blind —a term used in determining eligibility for government benefits. This term applies to individuals with central visual acuity of 20/200 or less in the better eye (with correction) or to those with significant reduction in their visual field to a diameter of 20 degrees or less (Bradley-Johnson & Ekstrom, 1998). The number of children with visual impairment is substantially smaller, with only 0.4 percent of students between the ages of 6 and 21 years receiving special education services because of a vision problem (U.S. Department of Education, 1992). In addition to special

arrangements in testing, individuals with visual impairment may require unique instruments for valid assessment. In assessing the intellectual functioning of the visually impaired, examiners have historically relied on adaptations of the Stanford-Binet. The Hayes-Binet revision for testing those with visual impairment was based on the 1916 Stanford-Binet; this instrument has since undergone several revisions. The most recent adaptation is the Perkins-Binet (Davis, 1980). The Perkins-Binet retains most of the verbal items from the Stanford-Binet but also adapts other items to a tactual mode. The Perkins-Binet possesses acceptable split-half reliability and shows high correlations with verbal scales of the WISC-R (Teare & Thompson, 1982). The developers of the Perkins-Binet have acknowledged that visual problems exist on a continuum by developing separate norms for children with usable vision (Form U) and no usable vision (Form N). Test developers have also succeeded in modifying the Wechsler Performance scales for

use with individuals with visual impairments. The Haptic Intelligence Scale for the Adult Blind (HISAB) consists of six subtests, four of which resemble the Digit Symbol, Block Design, Object Assembly, and Picture Completion tests of the WAIS Performance scale (Shurrager, 1961; Shurrager & Shurrager, 1964). The remaining two subtests consist of Bead Arithmetic, which involves the use of an abacus to solve arithmetic problems, and a Pattern Board, which requires the examinee to reproduce the pattern felt on a board that has rows of holes with pegs in them. The reliability of the HISAB is excellent and the authors provide normative data on a sample of adults with visual impairment. Most encouraging of all, HISAB scores correlate .65 with the WAIS Verbal IQ (Shurrager & Shurrager, 1964). Although the HISAB is still manufactured and sold by Stoelting Company, unfortunately, the test has never been investigated empirically. A search of PsychINFO for research with this instrument did not locate a single article.

Another interesting instrument is the Blind Learning Aptitude Test (BLAT), a tactile test for children from 6 to 16 years of age who are blind (Newland, 1971). The BLAT items are in bas- relief form, consisting of dots and lines similar to Braille. The items consist of six different types: recognition of differences, recognition of similarities, identification of progressions, identification of the missing element in a 2 × 2 matrix, completion of a figure, and identification of the missing element in a 3 × 3 matrix. Most of the items were adapted from Raven’s Progressive Matrices and the Cattell Culture Fair Intelligence Test. The BLAT was standardized on 961 functionally blind children 6 to 17½ years of age, in residential and day- care settings (Newland, 1990). The sample is said to be socioeconomically and racially representative of the U.S. population. The BLAT reveals excellent reliability, with internal consistency (Kuder-Richardson) of .93, and test–retest reliability over a 7-month period of .87 and .92 (two studies). The test correlates very well with the Hayes-Binet (r = .74) and the

WISC Verbal scale (r = .71). The BLAT also shows strong correlations with Braille oral reading speed and comprehension (Baker, Koenig, & Sowell, 1995). In conjunction with a verbal test, the BLAT is a promising instrument for testing the intelligence of children with visual disabilities. However, the test would profit substantially from minor revisions, updated norms, and a more thorough test manual. Dekker (1993) has developed a promising instrument for visually impaired children: the Intelligence Test for Visually Impaired Children (ITVIC). This test includes a number of haptic subtests (those relying only on the sense of touch), which are intended to replace traditional performance subtests like Block Design that require intact vision. Boter and Hoekstra-Vrolijk (1994) provide the compelling rationale for using haptic subtests with visually impaired children: Although the necessity for an IQ test with haptic

subtests for visually impaired children is evident in practice, the intelligence of

visually impaired children is usually still measured only through the use of the verbal subtests of the WISC-R. The risk of this is that an incomplete and one-sided picture is obtained. Children with little education, with a disadvantaged background or missing a good command of the language may be underestimated. (p. 135)

Designed for children 6 to 15 years of age, the test has separate norms for partially sighted and totally blind examinees. The instrument includes five verbal subtests adapted from existing instruments such as the Wechsler scales and seven new nonverbal subtests that rely on tactile perception: Verbal Nonverbal/Haptic Vocabulary Perception of Objects Digit Span Perception of Figures Verbal Fluency Block Design Verbal Analogies Rectangle Puzzles Learning Names Map and Plan Tests

Exclusion of Figures Figural Analogies

The full battery takes about three hours to administer. Currently, the test is published in Dutch, German, and English but has received limited use in the United States. This may be due, in part, to the size and weight of the test kit. The ITVIC comes in a large “hold-all” that cannot be easily carried from one location to another. Information about this specialized instrument can be found at www.bartimeus.nl. 7.10 TESTING INDIVIDUALS WHO ARE DEAF OR HARD OF HEARING More than 1 million Americans are deaf or sufficiently hard of hearing that they rely on American Sign Language (ASL) as their primary means of communication (Brauer, Braden, Pollard, & Hardy-Braz, 1998). Given the typical limited mastery of the English language of persons who are deaf and, vice versa, the typical psychologist’s limited (or nonexistent) skill in ASL, the proper and valid

assessment of individuals who are deaf poses a profound cross-cultural challenge. More is involved than just picking a test developed for, and normed upon, individuals who are deaf or hard of hearing and who use sign language. One problem is that sign language “can now be characterized on a multidimensional continuum encompassing numerous styles, lexical variants, syntactic structures, dialects, and approximations to or departures from English word ordering” (Brauer et al., 1998, p. 299). Thus, a test developed in standard ASL is not equally fair to all persons who are deaf. In general, the proper and valid assessment of persons who are deaf requires that interested psychologists immerse themselves in the Deaf culture and also seek relevant educational and training experiences: One especially needs a thorough understanding

of the implications of deafness and the use of sign language for making diagnoses for people who are deaf. Few hearing psychologists have these skills. The push is for specialized training programs in

deafness and psychology, a need that has been recognized for decades. (Brauer et al., 1998, p. 303)

If a consulting psychologist does not possess these skills, then the assessment of persons who are deaf should be referred to a person or agency with the requisite talents and expertise. The use of a sign language interpreter in the testing of persons who are deaf is a complicated and controversial matter. One concern is that the interpreter may inadvertently alter the content of the test, therefore affecting the validity of the findings. Certainly, it is unwise for parents or teachers to serve as interpreters. However, it is also true that persons who are deaf and who use sign language achieve higher IQs when the directions are signed than when they are delivered in the traditional manner (Braden, 1992). The preferred resolution is for the examiner to be fluent in sign language, so that any necessary translations stay within the bounds of standardized procedure. For the intellectual assessment of persons who are deaf or hard of hearing, the Wechsler

Performance subtests remain the tools of choice (Braden & Hannah, 1998). The impact of English language facility is minimized on these subtests, so it is thought that they provide a more accurate measure of cognitive skill than the Verbal subtests. Other tests sometimes used with persons who are deaf include Raven’s Progressive Matrices (Raven, Court, & Raven, 1992) and the Hiskey-Nebraska Test of Learning Aptitude, discussed previously. The WAIS-III is now available in a formal ASL translation (demonstrated on videotape), endorsed and disseminated by the test publisher (Kostrubala & Braden, 1998). 7.11 ASSESSMENT OF ADAPTIVE BEHAVIOR IN INTELLECTUAL DISABILITY The term intellectual disability is the currently preferred designation for the disability historically referred to as mental retardation. In fact, the authoritative 130-year-old agency that has promoted the interests of affected

individuals, the American Association on Mental Retardation (AAMR), recently changed its name to the American Association on Intellectual and Developmental Disabilities (AAIDD). The latest edition of its authoritative manual (Schalock, Borthwick-Duffy, Buntinx, and others, 2010) eliminated all references to the term mental retardation. The reasons for the change have to do with providing a more hopeful and optimistic outlook for persons with intellectual disability: The construct of intellectual disability belongs

within the general construct of disability. Intellectual disability has evolved to emphasize an ecological perspective that focuses on the person-environment interaction and recognizes that the systematic application of individualized supports can enhance human functioning. (Schalock, Luckasson, Shogren, and others, 2007)

In contrast, the outdated concept of mental retardation gradually has taken on excess meanings that tend to isolate the problem within

the individual rather than recognizing an ecological perspective. The assessment of intellectual disability is a complex and multifaceted concern that rightfully deserves a chapter or book of its own. Owing to space limitations, our coverage is necessarily abridged; interested readers are referred to Schalock et al. (2010) and Jackson, Mulick, and Rojahn (2007). Here we briefly summarize the diagnostic criteria for intellectual disability and then review several intriguing assessment instruments in modest detail. The most authoritative source for the definition of intellectual disability is the American Association on Intellectual and Developmental Disabilities. That organization defines intellectual disability as follows: Intellectual disability is characterized by

significant limitations both in intellectual functioning and in adaptive behavior as expressed in conceptual, social, and practical adaptive skills. This disability originates before age 18 (Schalock, et al., 2007, p. 118).

The AAIDD further stipulates that significantly subaverage intellectual functioning is an IQ of 70 to 75 or below on scales with a mean of 100 and a standard deviation of 15. The agency explicitly affirms the importance of professional judgment in individual cases. A low IQ by itself is an insufficient foundation for the diagnosis of intellectual disability. As noted, the definition also specifies a second criterion—limitations in adaptive behavior as expressed in conceptual, social, and practical adaptive skills. A diagnosis of mental retardation is warranted only when an individual displays a sufficiently low IQ and limitations in one or more of the broad areas of adaptive functioning. Furthermore, these deficits in intellect and adaptive functioning must have arisen during the developmental period— defined as between birth and the eighteenth birthday. Intellectual disability represents a continuum from very mild to substantially disabling. For this reason, previous terminology recognized four levels of disability: mild, moderate, severe,

and profound. However, current AAIDD designations represent a departure from this terminology. Instead of focusing on the shortcomings of the person, the manual introduces a hierarchy of “Intensities of Needed Supports,” which redirects attention to the rehabilitation needs of the client. The four levels of needed supports are intermittent, limited, extensive, and pervasive. However, the previous terminology referring to levels of disability will likely prevail for quite some time, so we have chosen to blend the old and the new approach in Table 7.8. The reader will notice a zone of uncertainty between levels of disability, which signifies that clinical judgment about all sources of information is required in diagnosis. Furthermore, even though these levels are calibrated by IQ ranges, we remind the reader that the examinee must also show corresponding deficits in adaptive skill. Under no circumstances is an IQ test a sufficient basis for diagnosing intellectual disability. Limitations in adaptive skill are more difficult to confirm than a low IQ. Fortunately, the

AAIDD stipulates specific skills within the three areas of adaptive functioning, namely: • Conceptual skills—language and literacy;

money, time, and number concepts; and self- direction.

• Social skills—interpersonal skills, social responsibility, self-esteem, gullibility, naïveté (i.e., wariness), social problem solving, and the ability to follow rules/obey laws and to avoid being victimized.

TABLE 7.8 Four Levels of Intellectual Disability

Mild Intellectual Disability: IQ of 50–55 to 70– 75+, Intermittent Support required. Reasonable social and communication skills; with special education, attain sixth grade level by late teens; achieve social and vocational adequacy with special training and supervision; partial independence in living arrangements. Moderate Intellectual Disability: IQ of 35–40 to 50–55, Limited Support required. Fair social and communication skills but little self- awareness; with extended special education, attain fourth grade level; function in a sheltered workshop but need supervision in living arrangements. Severe Intellectual Disability: IQ of 20–25 to 35–40, Extensive Support required. Little or no communication skills; sensory and motor impairments; do not profit from academic training; trainable in basic health habits. Profound Intellectual Disability: IQ below 20– 25, Pervasive Support required. Minimal functioning; incapable of self-maintenance; need constant nursing care and supervision.

Source: Based on Schalock et al. (2010) and Beirne- Smith, Ittenbach, and Patton (2002). • Practical skills—activities of daily living

(personal care), occupational skills, health care, travel/transportation, schedules/ routines, safety, use of money, use of the telephone (www.aamr.org).

In regard to the assessment of these limitations, the agency proposes that well-normed measures of adaptive skills are desirable, but the final determination is always a matter of clinical judgment. The first standardized instrument for assessing adaptive behavior was the Vineland Social Maturity Scale (Doll, 1935). Somewhat simplistic and coarse-grained by modern standards, the original Vineland scale consisted of 117 discrete items arranged in a year-scale format. An informant familiar with the examinee would check off applicable items. From these results the examiner would calculate an equivalent social age, helpful in the diagnosis of mental retardation. Still a respected instrument, the Vineland has undergone several

revisions and is now known as the Vineland Adaptive Behavior Scales, Second Edition (Sparrow, Cicchetti, & Balla, 2005). Since the release of the original Vineland scale, over 100 scales of adaptive behavior have been published (Matson, 2007; Reschly, Myers, & Hartel, 2002). These instruments vary greatly in structure, intended purpose, and targeted population. Broadly speaking, we can distinguish two types of instruments designed for two different purposes. One group of mainly norm-referenced scales is used largely to assist in diagnosis and classification. Another group of mainly criterion-referenced scales is used largely to assist in training and rehabilitation. We have chosen a few representative instruments for more detailed analysis. Scales of Independent Behavior-Revised The Scales of Independent Behavior-Revised (SIB-R; Bruininks, Woodcock, Weatherman, & Hill, 1996) is an ambitious, multidimensional measure of adaptive behavior that is highly useful in the assessment of intellectual

disability. The instrument consists of 259 adaptive behavior items organized into 14 subscales. The scale is completed with the help of a parent, caregiver, or teacher well acquainted with the examinee’s daily behaviors. For each subscale, the examiner reads a series of items and for each item records a score from 0 (never or rarely does task) to 3 (does task very well). A useful feature of the SIB-R is that examiners need a minimum of training and experience. Of course, a much higher level of competence is required to evaluate results and make decisions about placement or treatment. The 14 subscales of the SIB are arranged into four clusters, as outlined in Table 7.9. In turn, these four clusters constitute the Broad Independence Scale. Each subscale consists of a small number of discrete, developmentally ordered items. For example, the subscale on Eating and Meal Preparation has 19 graded items, including spearing food with a fork, eating soup with a spoon, taking appropriate- sized portions, and preparing snacks that do not require cooking. For each subscale, items are

administered until a predetermined ceiling is reached (e.g., 3 of 5 consecutive items scored 0). TABLE 7.9 The Subscales and Clusters of the Scales of Independent Behavior-Revised

1.Motor Skills Gross Motor—19 large muscle skills such as sitting without support or taking part in strenuous physical activities. Fine Motor—19 small muscle skills such as picking up small objects or assembling small objects.

2.Social and Communication Skills Social Interaction—18 skills requiring interaction with other people such as handing toys to others or making plans with friends to attend social activities. Language Comprehension—18 skills involving the understanding of spoken and written language such as looking toward a speaker or reading. Language Expression—20 tasks involving talking such as making sounds to get attention or explaining a written contract.

3.Personal Living Skills Eating and Meal Preparation—19 skills related to eating and meal preparation, ranging from drinking from a glass to planning a meal. Toileting—17 skills necessary to bathroom and toilet use. Dressing—18 skills related to dressing,

Raw scores for a subtest are added to obtain a part score. The part scores for each cluster are then added to obtain the cluster score. The score for the Broad Independence Scale is derived from the four cluster scores. The subtest scores, cluster scores, and the Broad Independence score can then be converted to a variety of normative scores to permit comparison of the examinee’s performance with the performance of the national norming sample. The normative scales include age scores, percentile ranks, standard scores, stanines, and normal curve equivalents. A separate, unique part of the SIB-R also assesses maladaptive behavior by measuring the frequency and severity of problem behaviors. The Problem Behaviors Scale includes eight major categories of personal and social maladjustment that could affect adaptive behavior: Hurtful to Self, Hurtful to Others, Destructive to Property, Disruptive Behavior, Unusual or Repetitive Habits, Socially Offensive Behavior, Withdrawal or Inattentive Behavior, and Uncooperative Behavior.

Examples of problem behaviors are listed, and the respondent must indicate the behaviors displayed by the examinee. In addition, the respondent describes the one most serious behavior in each category and rates it according to frequency of occurrence, severity, and typical management. The standardization of the SIB-R was well conceived and executed. The norm group consisted of 2,182 persons sampled to reflect the 1990 census characteristics. The normative data cover persons from age 3 months to adults over age 80. An additional sample of persons with mental retardation, learning or hearing disabilities, and behavior disorders was also tested. The value of the SIB-R was further strengthened by anchoring it to the norms for the Woodcock-Johnson Psycho-Educational Battery-Revised. The SIB-R is one component of this larger test battery, but can be used on its own. The reliability of the SIB-R is generally respectable, but somewhat variable from subscale to subscale and from one age group to

another. The individual subscales tend to show split-half reliabilities in the vicinity of .80; the four clusters have median composite reliabilities around .90; the Broad Independence Scale has a very robust reliability in the high .90s (Bruininks, Woodcock, Weatherman, & Hill, 1996). Validity data for the SIB-R are very promising. For example, the mean scores of various samples of disabled and nondisabled subjects show confirmatory relationships: SIB-R scores are lowest among those persons known to be most severely impaired in learning and adjustment. For disabled examinees, SIB-R scores correlate very strongly with intelligence scores (in the .80s), whereas with nondisabled examinees, the relationship is minimal (Bruininks et al., 1996). The SIB-R also possesses excellent convergent validity—the Broad Independence Score correlated .83 with the composite score from a similar instrument, the Vineland Adaptive Behavior Scales (Middleton, Keene, & Brown, 1990). Tan, Hultsch, Hunter, and Strauss (2010) reported

that a slightly modified version of the SIB-R was helpful in the evaluation of elderly clients with mild cognitive impairment. In sum, the SIB-R is an excellent tool for providing insights into an examinee’s current level of functioning in real-life situations in the home, school, and community settings. Although this instrument does not have a precise correspondence with the areas of adaptive skill listed in the definition of intellectual disability, there is substantial similarity. For example, the following areas of adaptive skills are well covered by subscales or clusters of the SIB-R: communication, self-care, home living, social skills, community use, health and safety, and work. The SIB-R or a similar instrument ranks as a mandatory supplement to individual intelligence testing in the diagnosis and assessment of mental retardation. Inventory for Client and Agency Planning (ICAP)

The Inventory for Client and Agency Planning (Hill, 2005) is one of the most widely used tests in the field of developmental disabilities. This test is suitable for children and adults with mental retardation, individuals who become disabled as adults through illness or accident, and elderly persons who have slowly lost their independence and, therefore, need special assistance. The focus of the instrument is on determining the need for special services such as personal care, remedial education, vocational training, or sheltered work environment. The test is a 16-page booklet that evaluates adaptive behavior, maladaptive behavior, and the need for assistance and supports. Amazingly, it can be completed in about 15 minutes by a parent, teacher, or caregiver who is well acquainted with the client. The scales and subscales of the ICAP are depicted in Table 7.10. Identical to the SIB-R, adaptive behaviors are rated on a scale from 0 to 3, with 0 indicating never or rarely does a behavior well (even if asked), 1 indicating does the task but not well, 2 indicating does the task fairly well,

and 3 indicating does the task well without being asked. The maladaptive behaviors are assessed in a more complex manner using open- ended questions and follow-up queries as to frequency, severity, and consequences of the maladaptive behaviors. This technique provides for a maladaptive behavior subscale with enhanced reliability (r = .80) in comparison to similar subscales from other instruments that reveal low reliability (r = .60). From a psychometric standpoint, the ICAP meets the highest standards. TABLE 7.10 Scales and Subscales of the Inventory for Client and Agency Planning

Scale

Nu mbe r of Ite ms

Subscales or Domains Measured

Descripti ve

10 Data on age, height, weight, legal status

Primary and Addition al Diagnose s

14 All relevant medical and psychological diagnoses

Special Needs

10 Special needs in vision, hearing, mobility, health care, medications

Residenti al Supports

2 Residential supports now and in future

School/ Vocation al Supports

2 School and vocational supports now and in future

Other Support Services

26 Survey of all support services needed, now and in future

Social/ Leisure Activitie s

16 Survey of social and leisure activities

Note: The ICAP also yields a Service Score based on Adaptive Behavior and Maladaptive Behavior. One of the most useful and appealing aspects of the ICAP is that it provides an overall Service Score based on both adaptive and maladaptive behavior. The Service Score, which ranges from 0 to 100, indicates the likely level of attention, supervision, and training needed by the client. The lower the score, the greater the need for oversight. For example, a child with severe disabilities and many maladaptive behaviors might earn a score of 5, indicating the need for intensive supervision virtually 24 hours a day. At the other extreme, a normal young adult with no behavior problems might earn a score of 95, indicating almost complete self-sufficiency.

Adaptive Behavior

77 Level of functioning in motor skills, social and communication skills, personal living skills, and community living skills

Maladapt ive Behavior

24 Self-injury, stereotyped, withdrawn, offensive, uncooperative, disruptive, destructive, hurts others

By intention, the Service Score was designed to predict not only the service intensity needed but also the costs associated with delivering the assistance. For this reason, state and regional users often collate their ICAP data in a computer database provided by the test publishers. In many states in the United States, the human services departments have linked their disability services with results from the ICAP. For example, in Colorado, the ICAP is used by the Division of Services for People with Disabilities to determine eligibility and to allocate funds for individuals receiving residential services and day care services (www.cdhs.state.co.us). Resources are allocated for other reasons as well, but the ICAP is foundational to the entire system of disabilities services. Certainly, this is an example of consequential testing: The fate of an entire group of individuals is linked to the soundness of the ICAP for purposes of determining services.

Additional Measures of Adaptive Behavior We remind the reader that measures of adaptive behavior vary greatly. Some scales are designed mainly for diagnosis, others for remediation. Some scales are useful with persons with severe and profound mental retardation who will never be employed, others with individuals with mild mental retardation seeking vocational training. Some scales are useful exclusively with children, others with adults. These instruments are not interchangeable, and the potential user must study their strengths and limitations carefully. The Vineland Adaptive Behavior Scales-II (VABS-II, Sparrow, Cicchetti, & Balla, 2005) is the most widely used measure of adaptive behavior in existence. The instrument is the outcome of a major revision and restandardization of the Vineland Social Maturity Scale, originally published in 1935 by Edgar A. Doll. Based on a semistructured interview with a caregiver or parent, the VABS

provides an evaluation in the following domains and subdomains: Communication (receptive, expressive, written), Daily Living Skills (personal, domestic, community), Socialization (interpersonal relationships, play and leisure time, coping skills), Motor Skills (gross, fine). The VABS-II is a widely respected instrument with good concurrent validity, including correlations in the range of .50 to .80 with the Wechsler scales and Stanford-Binet. However, some of the interview items require knowledge that the informants may not possess (e.g., whether a child says 100 recognizable words). Silverstein (1986) faults the normative data, noting discontinuous jumps in standard scores from one age group to another. Even so, the Vineland continues to be a highly popular test in clinical practice and research. A promising development in research is the increasing use of this instrument in other countries. For example, de Bildt, Kraijer, Sytema, and Minderaa (2005) report favorably on the validity of the VABS in a sample of 826 Dutch children with mental retardation, and Balboni, Pedrabissi, Molteni,

and Villa (2001) established that the instrument accurately identifies mentally retarded individuals with and without communication impairment, social behavior problems, and motor disabilities. The American Association on Intellectual and Developmental Disability (AAIDD) has developed several scales useful in the assessment of persons with cognitive limitations. We mention here just one of its products, the AAMR Adaptive Behavior Scales: Second Edition (Nihira, Leland, & Lambert, 1993). The residential and community version of this test, suitable for persons 18 to 80 years of age, is a psychometric tour de force that borders on overkill. The normative sample includes more than 4,000 persons with developmental disabilities from 43 states residing in the community or in residential settings. In addition to assessing the appropriate behavioral domains (e.g., independent functioning, domestic activity, self-direction, responsibility), a noteworthy feature of the instrument is the

careful attention to maladaptive behaviors, which are evaluated in eight domains: • Violent and antisocial behavior • Rebellious behavior • Eccentric and self-abusive behavior • Untrustworthy behavior • Withdrawal • Stereotyped and hyperactive behavior • Inappropriate body exposure • Disturbed behavior

This scale has been extensively validated and clearly distinguishes persons independently classified at different adaptive behavior levels.

7.12 ASSESSMENT OF AUTISM SPECTRUM DISORDERS Autism is not a single disorder, but a range of closely related disorders evident in the first years of life. Autism spectrum disorders (ASDs) include diagnostic categories such as autistic disorder, Asperger’s syndrome, childhood

disintegrative disorder, and pervasive developmental disorder, among others (American Psychiatric Association, 2000). Although the level of disability and specific symptoms vary from child to child, what all children with ASDs share in common is a core of difficulties with reciprocal social skills, communication abilities, and flexible behavior. Often, empathy is absent. Affected children may display stereotypic activities, interests, and behaviors. A characteristic vignette of a child with ASD might read as follows: Martin is a cute 2-year-old boy who is

perplexing and worrisome to his parents. He will only eat crunchy foods and refuses to use utensils. He rarely makes eye contact. When watching TV, he rocks back and forth and flaps his hands. He seldom speaks, although he does verbalize “music” when he wants to hear a favorite CD of children’s songs. He becomes enraged if his parents play a different CD. He appears self-absorbed and does not respond affectionately to his parents. For Martin,

taking turns is a foreign concept. He has a very short attention span. Even so, bright metal objects fascinate him.

According to the Centers for Disease Control and Prevention, about 1 in 88 children manifests an ASD, and these disorders are 5 times more common among boys than girls (Morbidity and Mortality Weekly Report, March 30, 2012). Early diagnosis and intervention are vital because of the improved prognosis (Hollander, Kolevzon, & Coyle, 2011). The assessment of children for ASDs is a complex endeavor that includes screening tests, behavioral observations, and diagnostic evaluation by specialists in pediatrics, neurology, and psychology. Excessive reliance on checklists or tests is unwise. Even so, appropriate scales can be a useful starting point. We survey a few good measures here. The Modified Checklist for Autism in Toddlers (M-CHAT; Robins, Fein, & Barton, 1999) is an appealing 23-item checklist that enjoys strong content validity. The M-CHAT is a screening test used with toddlers between 16 and 30

months of age to identify children at risk for ASDs. The authors openly acknowledge that the instrument yields a high false-positive rate. Thus, M-CHAT should be used only in conjunction with further diagnostic evaluation, in the event of a “failing” score. Items on the checklist resemble the following:

Children who fail three or more items (or two or more critical items) should be referred for further evaluation by specialists. The M-CHAT has been translated into more than 30 languages.

Does your child play with other children?

Yes N o

Does your child smile when you smile? Yes N o

Does your child engage in pretend play?

Yes N o

Does your child enjoy peek-a-boo? Yes N o

Does your child respond to his/her name?

Yes N o

Does your child sustain eye contact? Yes N o

Robins (2008) reported a large-scale study of 4,797 children evaluated with M-CHAT during toddler checkups. From this sample, 466 screened positive on the M-CHAT, including 362 families who completed a follow-up interview. From this group, 21 children eventually were diagnosed with ASDs. Remarkably, only four of these 21 children were flagged by their pediatrician. In sum, the M- CHAT yields a high false-positive rate, but this is an acceptable price to pay for identifying at- risk children who might otherwise go undetected for additional months or years. In fact, the “cost” of the false-positive identifications usually consisted of a telephone follow-up call or brief in-person interview to determine that further assessment was not warranted. Another widely used autism checklist is the Baby and Infant Screen for Children with Autism Traits-Part 1, referred to as BISCUIT- Part 1 by the authors (Matson, Boisjoli, & Wilkins, 2007). The instrument consists of 71 items that assess the core symptoms of autism in

toddlers 17 to 37 months of age. The items are completed by a parent or caretaker on a 3-point scale that includes 0 (not different, no impairment), 1 (somewhat different, mild impairment), and 2 (very different, severe impairment). Items are brief and resemble the following: communicates verbally, takes turns, sustains eye contact, responds to name. An exploratory factor analysis of results for 1,287 children enrolled in an early intervention program yielded a three-factor solution consistent with symptom clusters found in ASD children, supporting the construct validity of the scale (Matson, Boisjoli, Hess, & Wilkins, 2010). The BISCUIT-Part 1 also demonstrated good convergent validity with the M-CHAT, and appropriate divergent validity with measures of adaptive and motor behaviors in a sample of 1,007 toddlers (Matson, Wilkins, & Fodstad, 2011). Over 80 studies have been published on the scale. For a recent review, see Matson and Tureck (2012).