Large Scale Assessments
315
C H A P T E R N I N E
Testing young language learners through large-scale tests
Introduction
In this chapter we will look at the development and use of large-scale tests for young learners. Large-scale tests are externally developed tests that are administered to many learners, usually across school districts or school systems. They are usually standardized, that is, they follow a con- sistent set of procedures for designing, administering and scoring. Standardization is required when scores will be used to compare indi- viduals or groups. If children take the same test under the same condi- tions, then the scores in the tests are believed to have the same ‘meaning’ and are therefore comparable.
Large-scale standardized tests are employed for many reasons. They can save time and money as resources are pooled in one place, and efficiencies are maximized through shared development and administra- tion processes. Paper-and-pencil tests, often used in large-scale stand- ardized tests, are easy to administer and score, and are therefore less costly. Standardized tests have credibility when they are developed through research techniques. Facts and figures are impressive and can be reported to the public as evidence of effectiveness. They also have anonymity; schools and teachers do not need to be the bearers of bad news to parents and children but can refer to the impartial third party, the test, to convey a judgment. Lastly, standardized tests have comparability. With standardized test data, one school can be compared with others, locally, regionally and nationally ( Jalongo, 2000). Comparable data from large-scale standardized tests provides administrators with what they
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
believe is non-refutable evidence for accountability purposes when other, perhaps more qualitative, evidence is, in their opinion, not accept- able or trusted.
Many young second language learners are assessed through large-scale testing. They may be assessed on language-specific tests or on content- area tests designed for the majority. In the USA, ESL learners, or English Language Learners (ELLs), have, for many years, been tested on content- area tests designed for all students. Recently, yearly monitoring of English language proficiency of these learners has been legislated.
However, advocates of alternative assessment techniques discourage the use of large-scale testing with young learners. Nevertheless, testing of young language learners is taking place around the world. I therefore begin with an outline of the main criticisms of the large-scale testing of young learners. The main steps in test development that are outlined in this chapter, following Bachman and Palmer (1996), apply to both large- scale and smaller-scale assessment. However, a full and thorough appli- cation of the many test development steps is most likely to occur when a high level of resources, including research facilities, is available. The Cambridge Young Learner English Test provides the basis in this chapter for an illustration of a large-scale foreign language test with full resources to conduct the many steps in standardized test development.
In the second major section of this chapter, I examine approaches to the large-scale testing of the English language of second language learn- ers and emphasize the need to ensure that these tests focus on academic language proficiency, that is, the language of school. Alternatives to standardized testing are advocated by many educators, even in large- scale testing situations, and these are reviewed in this chapter.
How appropriate is large-scale standardized testing for young language learners?
One of the main criticisms of large-scale standardized testing is peda- gogical: that such tests do not usually provide immediate feedback for test takers to improve their learning, or to teachers to improve their teaching. They are designed primarily for administrative purposes; that is, to monitor children’s learning to inform teachers and parents about children’s relative progress (Is this child slipping behind?), and to inform the central authorities about progress patterns for policy and adminis- trative reasons (Is this school meeting its targets?). Indeed, Shohamy
316
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
(2001) has pointed out how there has been a shift in recent years in the primary role of test from measuring knowledge to enabling centralized bodies to control the content of education, and learning and teaching. Teachers argue that they prefer classroom performance-based assess- ment because they can gather information about performance and progress, give immediate feedback and adjust their teaching accordingly, thus immediately influencing the learning process.
Opponents of large-scale tests suggest that a test is a sample of one type of behaviour in one type of context (a testing context) and since children perform in different ways in different contexts, a child’s behav- iour in different settings can be radically different. The artificiality of a testing situation will influence what kind of a sample is collected, and it is therefore best to assess language use ability in a natural setting, rather than relying on the contrived situations required by standard- ized tests.
Some educators argue, too, that there is little value in comparing the progress of young learners with others at the early stages of their learn- ing. Young learners mature at different rates and their individual progress is best monitored with reference to broadly expected developmental norms, and with expectation of difference, not in direct comparison to others. Loss of self-esteem and confidence can result from negative feed- back in these vulnerable early years. Many argue it is the teacher who can give the best feedback to the child, and the most informed and detailed reports to parents.
Some educators argue that not only can large-scale standardized tests be of little direct use in the pedagogy of the classroom; in some situations they can have a negative impact on teaching and learning. In high-stakes testing situations, schools and teachers can become nervous about test results (schools’ future funding or teachers’ careers may depend on good results) and may systematically teach to the test. This inevitably leads to a narrowing of the curriculum.
The issue here is not whether children can learn to be better test takers. Of course they can. The question is: What has been sacrificed in the process? If the test begins to dictate the curriculum and severely restrict teachers’ innovative teaching practices, then the test – a mere sample of behavior – has become far more influential than it should be.
( Jalongo, 2000, p. 294).
Most large-scale standardized tests, for reasons of practicality, have a pencil-and-paper test format, that is, learners are asked to complete
Large-scale testing 317
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
various tasks and items on paper. Preparing children to take a paper- and-pencil test is likely to be at the expense of real experiences with the language. Similarly, a large-scale standardized test might be a test of language knowledge, tested through discrete-point items such as grammar or vocabulary items, rather than language use tasks. Discrete- point items do not reflect the construct that should be assessed in language teaching – that is, language use. As I have argued through this book, young learners need a breadth of experiences with the target language. They need to combine the use of the language with physical encounters with their environment, and they need opportunities to encounter the language in a range of social situations.
There is no doubt that the characteristics of young learners cause them to have some difficulties with large-scale standardized tests. Young learn- ers are often unfamiliar with test procedures and may not realize the importance of the test. While adults will address themselves to a test, knowing the way tests work, and how important they are, children may not be able to give the test the attention it needs to show what they can or cannot do. The length of tests can put children at a disadvantage; children are used to short, concentrated, often physical tasks and find it difficult to concentrate for very long. If the test is only short to accommodate this, the lack of breadth of sampling is likely to affect the score: for example, a child’s reading proficiency may be assessed on just one or two items. Response formats (e.g., circle the correct answer) may confuse children, especially when they are in an isolated situation without adult help that could quickly put them right. Children may not give an answer to an item; this may not necessarily be because they don’t know the answer, but because they are not sure of what to do or are afraid of making a mistake. These are some of the characteristics of young learners that need to be taken into account in testing situations.
Yet, others argue that large-scale standardized tests play an important role in education, and that such tests, developed appropriately, can be of value in the education of young learners. Standardized tests can be attached to the mandated or published curriculum, thus reinforcing the teaching of that curriculum, and perhaps the expected progress rate through that curriculum. Data from standardized tests help administra- tors to find out where additional resources are needed to improve learn- ing, or where programmes should be extended or curtailed. Current managerialist ideologies in education require administrators to have reli- able information about achievement in the range of schools and class- rooms under their jurisdiction. With data from standardized testing, data
318
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
administrators are able to ascertain accountability, and to provide rewards and sanctions accordingly. Parents are more aware of how their children are progressing in relation to other children and can make deci- sions about where to send their children based on the published achieve- ments of their schools on tests. For all these reasons, standardized tests are likely to stay with us.
Ultimately decisions on whether a standardized test for young lan- guage learners goes ahead must depend on the nature of the impact of the test on those young learners involved. Impact is one of the factors to be considered when a test’s ‘usefulness’ is evaluated. Systematic proced- ures should be followed to maximize the test’s ‘usefulness’ if it does go ahead. The procedures described below give guidance on how to develop both large-scale and smaller-scale tests, and how to ensure that they will be ‘useful’ for young learners.
Processes of test development (for both large- and small- scale testing)
This section provides a brief overview of test development processes. The next section illustrates these test development processes through a case study description of a large-scale test for young learners, the Cambridge Young Learners English Tests. Full details of test development processes can be found in specialist testing volumes; several writers have outlined these processes in detail (for example, Alderson, Clapham, and Wall, 1995; Bachman and Palmer, 1996; Davison and Lynch, 2002; Weigle, 2002). Test development processes should be followed in all test develop- ment, whether for large-scale tests or smaller-scale tests in a school or classroom. In small-scale situations there are likely to be limitations on resources which will influence the degree to which each step can be fol- lowed in depth, though all steps should be followed if a ‘useful’ test is to be devised.
Bachman and Palmer (1996) identify three phases in test develop- ment: a design phase, an operationalization phase and an administra- tion phase. These phases are likely to be followed in a linear fashion but may be followed in an iterative way, when developers return back to pre- vious steps as they find out something new, or as they realize they need to rectify a problem. The phases are summarized below. Figure 9.1 shows an overview version of Bachman and Palmer’s model of test development.
Large-scale testing 319
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
320
In the design phase, test developers gather information that will provide the foundation of information and theory required for the next phases of test development. Test developers establish the test purpose by communicating with those requesting the test, for example teachers, schools or education systems. These requirements form the mandate for the test (Davison and Lynch, 2002). The task types in the TLU domain (that is, the situations and tasks that learners use and need in their actual language use context), and the characteristics of the test takers are described in detail; these provide a basis for developing test tasks, and to check the ‘usefulness’ of the test. To do this, test developers draw on direct knowledge of the children and their curriculum or learning context, or they find out information from informants (usually teachers). In the design phase, test developers also define the constructs that are to be assessed. They do this by referring to the curriculum specifications, to a
Considerations of the qualities of
usefulness
Design phase
Describing purpose, TLU (target language use) domain and task types, and characteristics of test takers
Defining construct(s) Developing a plan for evaluating the qualities of usefulness Checking available resources and planning their allocation
and management
Operationalization phase
Preparing test task specifications and a blueprint Writing instructions Specifying the scoring method
Administration phase
Try out Operational test use:
Procedures for administering tests and collecting feedback Procedures for analysing test scores Archiving
Figure 9.1 Model of test development (adapted from Bachman and Palmer, 1996, p. 86).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
theory of language or to both. The constructs need to be defined clearly, as these provide the basis for the development of test tasks and also provide the basis for considering and investigating the construct validity. The design phase also involves the development of a plan for evaluating the qualities of ‘usefulness’ of the test. Also in the design phase, develop- ers need to check the resources (human, material and time) that will be available for the test development process, and to plan how they will be allocated and managed.
In the operationalization phase test developers are concerned with preparing test task specifications and a blueprint. Test task specifications will describe the types of test tasks that will be included in the test; the relevant task characteristics for each task in the test will be described in detail as part of this process. (For an example of test task specifications see Table 9.5 later in this chapter). A blueprint for the test consists of information about the structure, or overall organization of the test, as well as test task specifications for each task type to be included in the test. Writing instructions are prepared in this phase; these instructions give information to test writers about the structure of the test, the nature of the tasks and how learners are expected to respond. Also in this phase, the scoring methods are specified: firstly the criteria are defined, and then the procedures that will be followed to arrive at a score are determined.
The test administration phase involves administering the test, collect- ing information and analysing this information. The plan for evaluating the qualities of ‘usefulness’ determined in the design phase is put into operation in this phase when developers analyse the relevant qualitative and quantitative information that comes to them through the adminis- tration of the test. There are typically two phases in the test administra- tion phase: a try-out phase and an operational test use phase. A try-out, which is like a pilot phase, gives developers information about whether aspects of the test need to be altered. A try-out can be done with small groups or larger groups, and even in classroom testing is highly recom- mended. It is highly recommended that the try-out phases be iterative. Ideally, before full tests are assembled for piloting, individual tasks or sets of items should be tried out and revised if necessary. In the operational test use phase, the test is administered as a test. The testing environment is prepared; test materials are collected, examiners trained and the test is given to the intended test takers. Performances are scored and results analysed, and at this time further information can be collected about test ‘usefulness’. Procedures for these processes are prepared. Test scores might be analysed, for example to determine the quality of a test item, or
Large-scale testing 321
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
the reliability of a test score, using statistical procedures. Also in the test administration phase, archiving procedures are determined and carried out, so that tasks and other information can be easily retrieved as needed.
Considerations of qualities of ‘usefulness’ run throughout the above three test development phases; design, operationalization and adminis- tration. As outlined in Chapter 5, there should be an appropriate balance amongst the six qualities of ‘usefulness’: ‘This is done by determining minimum acceptable levels for each and recognizing that what consti- tutes an appropriate balance and appropriate minimum acceptable levels will vary from one testing situation to another’ (Bachman and Palmer, 1996) Plans to evaluate ‘usefulness’ are drawn up in the design phase. Plans will stipulate data collection to be carried out quantitatively, for example through the collection and analysis of test scores, and quali- tatively, for example through the collection and analysis of observers’ descriptions, and verbal self-reports from students. Data collection and analysis will mostly take place in the test administration phase. Questions will be asked through all development phases to carry out a logical evaluation of each quality. For example, ‘Is the language ability construct for this test clearly and unambiguously defined?’ See Bachman and Palmer (1996, p.149) for a checklist of questions for evaluating the qualities of ‘usefulness’.
Illustrating the test development process: The Cambridge Young Learners English Tests
The Cambridge Young Learners English Tests is a set of large-scale tests developed and administered by the University of Cambridge ESOL (English for Speakers of Other Languages) Examinations. At the end of 2001 the worldwide candidature had reached nearly 200,000, with large numbers of candidates in countries such as China, Spain, Argentina and Italy. Preliminary work on the Cambridge Young Learners test took five years. Work commenced in 1993, and the test was first taken by candidates in 1997. The following description of, and details of the development of, the Cambridge Young Learners Test are drawn from Cambridge ESOL pub- licity materials, research notes and sample tests, published articles and interviews with Cambridge ESOL personnel ( Taylor and Saville, 2002; University of Cambridge ESOL Examinations, 2003). The three phases of test development – test design, operationalization and administration – provide the framework for the presentation of information.
322
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
The test design phase
Describing the purpose: The purpose of the Cambridge Young Learners English Tests is described in test publications. The overall purpose is to assess the English language ability and progress of young English as a foreign language learners, in many countries around the world. The stated aims of the Cambridge Young Learners English Tests are to:
• sample relevant and meaningful language use • promote effective learning and teaching • encourage future learning and teaching of English • measure accurately and fairly
(University of Cambridge ESOL Examinations, 2003, p. 6)
The three tests together build a bridge to take young learners of English from beginner to the beginning level of the Key English Test (KET) in the Cambridge Main Suite exams.
Describing the TLU domain and task types and the characteristics of the learners, and defining the construct: The TLU domain and task types for young foreign language learners were established in the design phase for the test. Researchers conducted a close examination of curriculum design and pedagogy for young learners, recent textbooks and other resource material (e.g., a CD ROM, developed by Homerton College, Cambridge University’s teacher training college). Textbooks and teaching materials in classrooms around the world were reviewed, in order that the main content areas (topics, vocabulary, etc.) which frequently occur in young learner programmes were reflected. Theories of children’s foreign language learning (the way they learn, but also the pathways they take) were also examined. The presentation of these materials was also reviewed, in order to inform the way that tests might be presented. There is now about an 80% overlap with textbooks in the test syllabus. Item banks, that is, sets of items which have been developed and trialled, were built up, and this process continues. These item banks are drawn upon in the development of indi- vidual tests.
The test takers for the Cambridge Young Learners English Tests are chil- dren learning English as a foreign language around the world, ranging from ages 7 to 12 (with some children accepted at 13 years of age if they are learning alongside 12-year-olds).
Large-scale testing 323
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Cambridge ESOL has paid particular attention to the educational con- sequences of using a language test with young learners, and the follow- ing areas were carefully considered:
• current approaches to curriculum design and pedagogy for young learners, including recent textbooks and other resource materials (e.g., CD ROM);
• children’s cognitive and first language development; • the potential influence of test methods, including the familiarity
and appropriacy of different task types, item formats, typography and layout;
• probable variation between different first language groups and cultures.
(University of Cambridge ESOL Examinations, 2003, p. 5)
Taylor and Saville (2002) describe how extensive literature reviews were undertaken into children’s socio-psychological and cognitive develop- ment, the second language teaching and learning of young learners, and issues in second language assessment of young learners. This helped the developers to form close knowledge of the children that would be taking the test, and to identify the nature of the constructs to be assessed. The handbook states that the test samples ‘relevant and meaningful language use’. The four macro-skills (listening, speaking, reading and writing) are covered, though there is emphasis on oral/aural skills ‘because of the primacy of spoken language over written language among children’ ( Taylor and Saville, 2002). Writing is largely at the word/phrase (enabling skills) level ‘since young children have generally not yet developed the imaginative and organizational skills needed to produce extended writing’ ( Taylor and Saville, 2002).
A syllabus was subsequently devised to indicate the nature of the con- struct and the range of knowledge and skills to be tested. These are described through lists of topics, notions and concepts, structures and vocabulary for each level. As part of the structure list, language use (com- munication) items are itemized, connecting to the language (grammar and structures) and language items (examples). The syllabus for each level is provided in a free handbook, and is also available on the Internet. Table 9.1 shows an extract from the structure list for the first level, Starters. Other lists of topics, notions and concepts for each level are also provided, along with an alphabetic vocabulary list. Thus, the construct for assessment is clearly established, and also openly available to all stakeholders.
324
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Large-scale testing 325
Table 9.1 Extract from Starters Structure List (University of Cambridge ESOL Examinations, 2003, p. 13)
LANGUAGE (GRAMMAR LANGUAGE USE LANGUAGE ITEMS AND STRUCTURES) (COMMUNICATION) (EXAMPLES)
THE ALPHABET Writing down spelling That’s W-H-I-T-E.
NOUNS
Singular and plural, Asking who people are Are you Bill? including limited, specified, and identifying people It’s Pat. irregular plural forms (Proper nouns) (Common nouns) Responding to requests They’re oranges, not lemons. Possessive forms: /’s /s’ for information about That’s Ann’s bike.
objects Talking about ownership
ADJECTIVES Describing and He’s a small boy. size, age, colour identifying objects, Your face is very dirty.
people and animals
Identifying colours It’s a red car.
DETERMINERS a, an, the, some Identifying objects, It’s a banana.
animals, fruit, Who’s eating an egg? vegetables, etc. (with Put the tomato on the table. countables and He’s got some apples. uncountables)
my, your, his, her, our, their Talking about It’s my brother’s birthday. possessions and relationships
(list continues)
The constructs, that is, the expected language use and language know- ledge as identified above, become more complex in the specifications for Starters, to Movers, to Flyers, the three levels of proficiency in the test.
Developing a plan for evaluating the qualities of ‘usefulness’: Plans were put into place to check that the test would have the qualities of ‘use- fulness’ or its equivalent theoretically. These included plans for ongoing qualitative and quantitative data collection and analysis in the test administration phase, such as surveys of test users (centres, teachers, examiners) in different markets to ask for feedback on various aspects of
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
the tests; analysis of test papers to ascertain the skills focus in individual tasks and balance across papers, and trialling of task types with different learner groups.
Checking available resources and planning their allocation and man- agement: Plans for the allocation and management of resources were put in place in the early stages. Resources were needed to cover the costs of test development, for the planned research to establish ‘usefulness’, for publicity materials for the test, handbooks that described the specifica- tions of the test, and reports that reported research findings and informa- tion about candidate performance from year to year. The development of a large-scale test such as the Cambridge Young Learner English Tests involved initial outlays that would not be recouped for several years.
In the test design phase, the groundwork was therefore set for the test.
The operationalization phase
Preparing test specifications and blueprints for tasks: The Cambridge Young Learners English Tests are structured around three levels of profi- ciency, Starters, Movers and Flyers. The three levels of the test can be taken by any age group, depending on their hours of English tuition. However, Starters is designed for children from the age of 7. Movers is typically taken by children aged between 8 and 11, and Flyers test takers are typ- ically aged between 9 and 12 years.
In order to focus on success and to motivate children to continue to learn English, the test developers established that all candidates who complete the test would receive an award. The award would focus on what they can do, rather than what they cannot do, and would give credit for having taken part in the test. This aspect of the test addresses the vul- nerability of young learners.
A blueprint for each test component (listening, reading/writing and speaking) and for each level was prepared. These are published in the handbook. There is therefore a clear, explicitly stated expectation of the types of tasks or items that will be in each test, available for teachers, parents and test takers.
Following the definition of the construct in early phases of the project, tasks in the test focus on meaning rather than form: they are based on the kinds of task-based communicative activity, often interactive in nature, which are already used in many primary classrooms around the world.
326
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Large-scale testing 327
A guiding principle for the project has been a desire to close the distance between children’s experience of learning and testing ( Taylor and Saville, 2002, p. 3). The age of test takers is taken into account in the task types. Tasks are brief and ‘active’ or ‘game-like’, e.g. colouring activities, and there are frequent changes of activity or task type. Test takers’ cultural differences are taken into account in the development of test tasks. Items that may offend, or might not be understood, are screened out. Skin tones in pictures need to be varied; words used may be regional (should ‘arm- chair’ or ‘sofa’ be used?). Tasks are presented in a clear and attractive format to meet the interests of young learners. Colourful illustrations help children to relax and feel less nervous. The tests are ‘topic-led’ like many popular course books ( Taylor and Saville, 2002).
An example of a blueprint for tasks appears in Table 9.2. It is the blue- print for the Starters Listening test tasks:
Table 9.2 Summary of Starters Listening Test Components (University of Cambridge ESOL Examinations, 2003, p. 31)
I’ve updated this to show changes in the 2003 version of the YLE Handbook
PARTS/ MAIN SKILL FOCUS INPUT EXPECTED NO. OF TASKS 1 RESPONSE/ ITEMS
ITEM TYPE
Listening for lexical Picture � Carry out instructions 5 items and dialogue and position things prepositions correctly on a picture
Listening for Illustrated Write down numbers 5 numbers and spelling dialogue and spelling
Listening for Pictures � 3-option multiple- 5 information dialogue choice (pictures; (present tenses) tick the correct picture)
Listening for lexis Picture � Carry out instructions, 5 and relative position dialogue locating and colouring
correctly
Each of the tasks or parts is described in further detail, for example:
Part 3: This task consists of five questions, each a three-option multiple- choice with pictures. The information is conveyed in a series of
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
328
five self-contained dialogues. The speakers are always clearly differentiated by age or sex. There is a focus on the use of verbs in present tenses. (University of Cambridge ESOL Examinations, 2003)
Specifications are precise. For example, for the speaking test, an interview, the child is met by an usher (preferably someone who is known to them) who explains the test format in the child’s first language then takes them into the exam room and introduces them to the examiner. The interview is three to five minutes long for Starters, five to seven minutes for Movers and seven to nine minutes long for Flyers. The interviewer is trained to follow an ‘interlocutor frame’, a script on which the interview questions (and responses) are based. The interlocutor can correct the child if the child is misinterpreting the task, but not in his performance in the task. Examiners are told to be encouraging, and actively praise children taking the tests, in order to put the children at ease and help them to do well in the speaking tests. Children need the reassurance of knowing they are doing the right thing to feel confident about tackling a task such as a problem-solving activity or describing a series of pictures. These instruc- tions for the interviewer are part of the blueprint. The components of the interview are also written out for each level, as Table 9.3 illustrates.
Table 9.3 Components for Movers Speaking Component (University of Cambridge ESOL Examinations, 2003, p. 23)
PARTS INPUT EXPECTED RESPONSE/ITEM TYPE
1 Greeting and name check: two Identify four differences between similar pictures pictures
2 Picture sequence Describe each picture in turn
3 Picture sets Identify the odd one out and give reason
4 Open-ended questions Answer personal questions
Each level of the test has different procedures and tasks in the interview, in order that the developing proficiency level of children can be tapped.
Specifying the scoring method: Scoring methods were developed. The scoring methods were determined by the task type. Thus the speaking component is scored with the use of a rating scale that includes interac- tive listening ability, pronunciation and production of words and phrases.
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Large-scale testing 329
The task in Part 4 of the Starters reading and writing component is a gap- filling (prompted) task with one-word answers; therefore guidelines for scoring on a correct/incorrect basis are given. A test is made ‘easier’ if misspellings are accepted. Cambridge ESOL has decided that, since they are looking for meaning in language use, there can be some acceptable misspellings. ‘Spellings which are nearly correct, rather than completely correct, will be accepted, e.g. “trea” would be accepted if the expected answer was “tree” ’. (University of Cambridge ESOL Examinations, 2003). Examples of marking keys are available. Mark distribution is specified, for example, 5 marks for each part. The following is part of a sample task.
Movers Reading and Writing Task (this is the third of three pictures in story sequence. Each picture has part of the story and questions beneath it).
Jane said, ‘Look at this! There’s a cupboard here!’ Jim carefully opened the cupboard door. He saw two green eyes looking at him. ‘Help!’ he shouted and they all ran outside and stood behind a tree. Jane was afraid and she climbed up the tree. Then the door slowly opened, and a black cat walked out.
7 What did Jim see? ...........................................................
8 Where did they stand outside? ...........................................................
9 Why did Jane climb up the tree? because .............................................
10 What came outside after the children? ............................................................
Figure 9.2 Movers Reading and Writing Task, Part 5 (University of Cambridge ESOL Examinations, 2002, p. 20).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
The marking scheme for the whole task is presented as follows. The unbracketed words are needed to complete the answer correctly; the bracketed words are also acceptable in the answer.
Part 5 10 marks
1 (an)(old) house 2 play (in it) 3 (the) door 4 Jane (and) Peter 5 Jane (did)/(wanted to) 6 (the) bedroom 7 (two)(green) eyes 8 Behind (a) tree 9 (she was) afraid 10 (a)(black) cat
Figure 9.3 Marking key for Flyers Reading and Writing, Part 1 (University of Cambridge ESOL Examinations, 2002, p. 22).
Specifications for the final scores for the test are that shields will be pre- sented to children, with five shields as the top score on the test.
Thus, in this phase, the specifications for the test tasks and the test as a whole have been established, in preparation for the final test adminis- tration phase.
The test administration phase
In test writing, test writers on the Cambridge Young Learners English Tests follow the task specifications exactly, keeping, for example, within the vocabulary list and characteristics of the input (gender of speakers, number of speakers, etc.). Illustrators play a critical role in the develop- ment of tasks, since most tasks have pictures. These are adapted on com- puters as test writers require changes. As a test developer explained, the illustrators (with the task writers) have to ‘see into the child’s world, and see how they think’.
Try-out: In the developmental phase of the test, sample tasks were pre- pared according to each blueprint. Versions of the test were trialled in 1995/6 with over 3,000 children in Europe, South America and South East
330
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Asia. Feedback from teachers was collected, and children’s answers were statistically analysed. This information was used to construct the live test versions. The trialling and feedback confirmed, for example, that ques- tions should be in colour ( Taylor and Saville, 2002).
Operational test use phase: The administration of the Cambridge Young Learners English Tests is done through test centres around the world. Centres can apply to be a Cambridge Young Learners testing centre. Schools can apply, if they have the required facilities. Cambridge ESOL visits, checks and approves each centre. Guidelines exist for testing centres in relation to confidentiality, requirements for destruction of papers and other proced- ures. Approved Cambridge Centres are able to administer tests at times that suit them. They order versions of the test from Cambridge ESOL and admin- ister them to fit in with local conditions (school terms, holiday periods, etc.) The administration department ‘manages’ the versions of the Reading, Writing and Listening papers to ensure security according to what the centre has administered in previous sessions. The Speaking Test materials are supplied in sets of 10 (from 2004) and are sent out once the notification of intention to hold a test is received. This allows increased choice in different markets and also improved security. The completed question papers and mark sheets are returned to Cambridge to be marked. Results are issued within two weeks of receipt of the scripts by Cambridge ESOL. Papers are double-marked, checked (sometimes three times) and then put into the computer twice. Candidates’ numbers are checked in the com- puter. Specific arrangements are made with some countries, for example China, for timing of testing, and marking of scripts.
To recognize one of the purposes of the test (to encourage future learn- ing and teaching) the developers decided to emphasize predictability over authenticity in the tests. To ensure authenticity in tasks, a degree of unpredictability in language of the input and the expected responses (e.g., the length of the text, the grammar and vocabulary items) is needed. The construct of the tests aims to ensure adequate authenticity by mir- roring tasks and approaches used in young learner classrooms around the world. Yet the developers believe that predictability is more important than authenticity for younger learners. Predictability is needed to bolster chances of success, and to ensure that the test is fair. Therefore they chose to strengthen predictability by publishing clear specifications for tasks and providing vocabulary lists that needed to be learned. The chances for children to be successful are bolstered by the accessibility of blueprints, sample tests, handbooks and research articles.
Large-scale testing 331
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Considerations of the qualities of ‘usefulness’: or its equivalent were and continue to be an integral part of the test administration phase. Monitoring and evaluation, which are essential for test validation, are ongoing for these tests (Marshall and Gutteridge, 2002). An empirical process of calibration, using Rasch analysis, is used to check difficulty of items. However, since these are high facility tests, that is, they are designed so that most test takers can do well, there can be a lack of discrimination and the statistics can become unstable. Data are also collected through anchor items. These anchor items are then cross- compared with other tests, for example the KET (Key English Test in Cambridge Main Suite exams). It is possible to identify candidates, and to follow and observe their progress as they move up the three levels, and into the KET. Their rate of progress, that is, the time they need to progress, is monitored. Progress would be expected to be lock step – if not this provides feedback on difficulty of each level and items within them.
Ongoing research is integral to the test. For example, the following areas of research are relevant to the Cambridge Young Learner Speaking Test (Ball and Wilson, 2002):
• Developing a corpus (audio-recorded database) of speaking tests, made up of a representative sample of Young Learner Speaking Tests from around the world. These can then be searchable by a range of variables including the age or first language of the candidate or the marks awarded.
• Undertaking qualitative analyses of transcriptions. For example, the following questions can be asked about the storytelling task:
• How do candidates perform in the storytelling task compared to other parts of the test?
• Are there qualitative and/or quantitative differences between the language produced in the storytelling task and other parts of the speaking test?
• Do candidates hesitate or display uncertainty or nervousness in the storytelling task?
• How do examiners use back-up questions in the storytelling task?
• Validating the rating scales. Are the assessment criteria for the Young Learner Speaking Tests appropriate? Results of a special re-rating project and questionnaires/protocol analysis with examiners will help Cambridge ESOL to evaluate the effectiveness of the assessment cri- teria and rating scales.
332
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Summary: the test development process
Large-scale testing of young learners inevitably requires attention to the special characteristics and needs of young learners. The Cambridge Young Learners English Tests provide an example of a set of large-scale language test for young learners. The test development phases and ongoing research help to ensure that children have a fair but also an enriching (rather than a discouraging) experience in the taking of the tests.
The design phase establishes the purpose of the test, young learner characteristics, the TLU domain in which they are learning, the typical tasks they learn through and the constructs that are to be assessed. This step is critical to the rest of the test development process. Once the design phase is established, the test is targeted for young learners and their needs and interests. The operationalization phase requires developers to prepare young learner-related specifications and blueprints that match the requirements set out in the design phase. The test administration phase follows through to actual use of the test with young learners, and an ongoing evaluation of the test for ‘usefulness’, specifically in relation to young learners. Young learners are at the forefront in each develop- ment phase.
The remainder of this chapter is concerned with issues in the large- scale testing of young second language learners, that is, of children who are being tested in the majority language of school, and often simultan- eously in subject content. These children are in a different situation from foreign language learners who are usually assessed in the foreign lan- guage at the levels of achievement appropriate to the foreign language curriculum they have been studying.
Large-scale standardized tests for school-age second language learners
The test development processes described above underpin all language test development, whether the test is for foreign or second language learners, and whether it is for younger or older learners. Differences in test purposes, the TLU domain, the characteristics of learners and so on will influence the final nature of the test. Thus, the multiple points at which second language learners enter into language learning, the requirements of the academic language of schooling in their TLU domain and the rapid progress that second language learners are likely to make
Large-scale testing 333
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
are just some of the many factors that make test development decisions different for second language learners. This section is concerned with the macro-decisions that are made around large-scale testing of second lan- guage learners, and will then examine how test development processes are applied to testing academic language proficiency. Most of the issues in this section apply to all learners at school, whether they are older or younger, though issues related to academic language proficiency become more and more salient as children progress through into upper primary school and beyond.
There are two main scenarios in which second language learners are tested on a large scale. The first scenario is by far the most common; this is when second language learners are tested on large-scale content-area tests normed on the general population of students and taken by all stu- dents. Thus, second language learners take tests in maths, science and other subjects designed to test all students’ achievement. In some coun- tries, the content-area tests that young learners are expected to take are limited to literacy and numeracy tests, again normed on the general population. In these testing situations, second language learners are simply counted as part of the general population and assessed accord- ingly. The second, less common, scenario for large-scale testing of second language learners is when a specifically designed second language test (rather than a majority-normed) test is used to monitor the language pro- ficiency of second language learners.
These two large-scale testing scenarios for second language learners feature in different ways in different countries around the world. The first scenario is played out in most countries. For example, in the United Kingdom, second language learners are assessed through National Curriculum tests; that is, their progress is assessed through the common content-area testing procedures for all children. In Australia, second lan- guage learners are assessed through common large-scale literacy tests although in most States and Territories further teacher-based monitoring of their language progress using ESL standards is also conducted. In the United States, large-scale content-area tests, based on curriculum stand- ards, are commonly administered to all learners. All students are expected to take these tests, though early second language learners have sometimes been excluded. A federal policy initiative, the No Child Left Behind Act (2001), has meant that the second scenario, the testing of learners on specifically designed second language tests is also being pursued in the United States. The act requires the use of a uniform state- level data set for second language learners, collected through yearly
334
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
large-scale English language proficiency assessments of second language learners. This initiative has resulted in the development and use of a number of English language tests specially designed for early second lan- guage learners. The data from these tests are available to be used for a number of purposes:
• collecting baseline data on academic language proficiency to determine annual growth and annual yearly progress
• monitoring student progress in academic language proficiency achievement on a summative basis
• reclassifying, redesignating, or transitioning students from support services
• determining accountability for student learning • measuring maintenance of student progress after transition from
support services • providing feedback to all educational stakeholders
(Gottlieb, 2003 , p. 3)
In the following sections we look at the pitfalls in tests in the first sce- nario – that is, where school-age learners are tested in content-area tests designed for majority, first language learners. There are, of course, many pitfalls in tests in the second scenario, those designed specifically for second language learners. These generally relate to the pitfalls that are encountered in every test development process; Abedi (2004, p. 12), for example, has commented on the weaknesses of many English language proficiency tests used for second language learners in the United States, suggesting concerns with their operationalization of language profi- ciency, with validity and reliability, with the adequacy of scoring direc- tions, and with the limited population on which test norms are based. Ways to include academic language proficiency in the operationalization of language proficiency in tests for second language learners are addressed in the final section of this chapter.
Pitfalls in testing second language learners through large- scale content tests normed on first language learners
Pitfalls in the design phase
Much has been written about the pitfalls of assessing second language learners on tests normed on majority group, first language learners (Cummins, 1983; Garcia and Pearson, 1994; Valdes and Figueroa, 1994;
Large-scale testing 335
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Butler and Stevens, 2001; McKay, 2001; Bailey and Butler, 2003). When tests are normed on majority first language learners, right from the design phase in the test development process, learner characteristics and the expected performance (the construct to be assessed) are more likely to be invalid for second language learners. The TLU domain and tasks are generally the same for both first and second language learners (they are learning in the same mainstream classrooms) unless they are in intensive language centres, but the expected performance in large-scale content- area tests is normed on the language performance and academic achievement of the majority. In addition, the nature of, and expected levels of, content-area knowledge will be normed on mainstream first language students, thus creating content bias in tests. Young learners who speak a dialect of the majority language and/or who come from a non-mainstream culture will also experience the same difficulties with regard to the expectations and content bias in tests.
In addition, when second language learners take content-area tests, their developing language proficiency is likely to have an influence on inferences that are made on the basis of the test scores, especially when the tests are made up of performance assessment tasks (Bachman, 2002). That is, the tests may be invalid for second language learners because they may not be able to show the extent of their topical knowledge relevant to the assessment tasks. This is also a form of content bias. In a similar vein, test tasks may have content bias because they are cultur- ally inappropriate for students from different cultural backgrounds. Culturally appropriate tests may need to be informed by research with different cultural groups; this research, including interviews and think- aloud protocols with students, would help to reveal students’ difficulties, from interpretation of questions because of different communication styles, to differences in contextual understandings that influence performance. Solano-Flores and Trumbull (2003) give an example of mis- taken contextual understanding in which a student from a low-income family misinterprets a math ‘Lunch Money’ problem, understanding that the mother has only $1 instead of in fact having multiples of $1. They conclude: ‘understandings of non-mainstream language and non- mainstream culture must be incorporated as part of the reasoning that guides the entire assessment process’ (Solano-Flores and Trumbull, 2003, p. 12). Each pitfall that arises in the design phase inevitably resurfaces in the next two phases, emphasizing the need for an appropriate foundation to be established in the design phase.
336
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Large-scale testing 337
Pitfalls in the operationalization phase
In the subsequent operationalization phase of test development, large- scale content-area tests normed on first language learners are shaped by the learner characteristics and constructs defined in phase one. The cri- teria for scoring will reflect the construct established in the design phase, that is, the language achieved by successful first language learners of this age group. In the marking of the test it is likely, then, that second language and literacy characteristics will be regarded as errors, rather than recog- nized as evidence of creativity, risk-taking and progress along second lan- guage learning pathways. Students’ cultural misunderstandings and linguistic approximations will also be marked as incorrect.
Pitfalls in the administration phase
In the administration phase, statistical analysis in the try-out stage is likely to eliminate items that do not represent the majority.
To create a final test, those items that have the lowest correlation with the total test score are eliminated on the grounds that they provide a poor estimate of the phenomenon being measured. In other words, those very items on which low-scoring students do comparatively well disappear! If we remember that low-income and ethnic minority students are overrepresented in the set of low-scoring students, then it is almost inevitable that minority students will perform relatively poorly on final versions of tests built through this process.
(Garcia and Pearson, 1994, p. 343)
When a test is administered, it may have prescribed time limitations, and these are problematic for second language learners taking standardized content-area tests. Early second language learners take longer to process both the questions and their answers. As a result they are deprived of the time they need to show what they are able to do. Unfamiliar vocabulary and paraphrasing and expressions written in academic language may ‘throw’ students who know the correct answer (Garcia and Pearson, 1994). Thus in each stage of the test development process, there are factors militating against the valid and fair testing of second language learners.
It is possible to control for content bias, that is, for differences in stu- dents’ knowledge of topics by, for example, making sure that a variety of
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
topics are covered in the test, and by eliminating questions that can be answered without having to read the information provided in the task (Garcia and Pearson, 1994). Garcia (1991) found that when topical know- ledge was controlled statistically in a standardized test, the comprehen- sion differences between Latino and Anglo children disappeared. Garcia and Pearson (1994, p. 349) also report that it is possible for second lan- guage learners to demonstrate their understanding of a text more effectively when they are allowed to use their first language in the assess- ment task. Research of this type is informing large-scale test developers that there are weaknesses in large-scale content-area tests for second lan- guage learners that need close attention. Other solutions recommend that early second language learners are not included in standardized content-area tests at all until they have reached a certain level of profi- ciency in the second language. These types of solutions are gaining cred- ibility in the United States; details of such proposals are discussed below.
Concerns about impact
The above concerns about large-scale tests normed on majority first lan- guage learners relate in particular to their validity as tests for second lan- guage learners. An aspect of validity is the nature of impact of a test on teaching and learning, and on young learners’ lives. In large-scale testing, the scores of second language learners are inevitably lower, especially in the first several years of their schooling in the new language. The scores of indigenous children in Australia, for example, on national literacy and numeracy tests are consistently reported as significantly lower than the norm. Many of these children live in communities in the outback, though many also live in country towns and cities. Many of those who live in the outback speak their own language(s) at home and have little contact with English until they come to school. Some speak Aboriginal English, a dialect of standard Australian English. They are expected to achieve as well on the literacy and numeracy tests as all Australian learners, but this is not always possible, especially in the early years of school. The impact of low results for individuals, and for the group (since the results of indigenous learners are generally combined in analyses of results), are feelings of low-esteem felt by indigenous children and their parents, and a sense of discouragement felt by their teachers. Many second language children from a range of cultural and linguistic backgrounds in countries around the world may not be as visible as indigenous learners in
338
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Australia, but the impact of standardized content-area tests normed on their majority-language-speaking peers may be negative in similar ways. Yet the reality is that in second language terms, many learners are achiev- ing strongly in the target language at their own level. Large-scale content- area tests do many second language learners a disservice; they ignore the huge advances they are making, hinder them from showing what they know in the content area and then deliver results that can cause distress to the learners, their families and their teachers.
In addition, in high-stakes situations, tests can have a ‘disproportion- ate curricular influence’ (Garcia and Pearson, 1994) in that teachers may teach to the test to gain the best results from their students. The test becomes the teachers’ main reference point, with the result that the cur- riculum may be narrowed to the scope of the content covered in the test. Second language learners may be deprived of the many experiences they need in the language to help them to progress. Writers who deal with the negative impact of standardized content-area tests point out that many of these effects are also felt by students of lower socioeconomic status who also experience dissonance with the dominant culture.
Important research is taking place, especially in the United States, designed to counteract the negative influences of standardized content- area assessment on second language learners. The following section out- lines some strategies that can be used to avoid some of the pitfalls of large-scale content-area tests.
Strategies to avoid some of the pitfalls of large-scale content testing normed on first language learners
The reality is that in current managerialist (or economic rationalist) educational environments, the pitfalls of large-scale content testing are outweighed, in administrators’ minds, by the need to gather overall, com- parable data about the achievements of all students, teachers and schools. Given this situation, can large-scale testing be made to be more valid and have a more positive impact on second language learners? This section outlines two strategies that have been proposed or are being used; though their effectiveness is not yet fully established. They are that the education system:
• introduce accommodations into large-scale tests
• restructure testing pathways for second language learners
Large-scale testing 339
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Introduce accommodations into large-scale tests
Accommodations are additional support mechanisms provided in a test for a designated group of test takers who need help to access the content or to demonstrate what they know. Butler and Stevens (2001) suggest that there are two categories of test accommodations: modifications to the test itself and modifications to the test procedure (see Table 9.4 below). In each category there are a number of ways that the test can be modified to help second language learners. Some accommodations are likely to be less valuable for young learners than others (e.g., the use of dictionaries and glossaries with younger learners) but most can be applied easily to young learner assessment. Many accommodations make the text and the questions more accessible and give test takers a better chance to show what they know.
Table 9.4 Two categories of accommodations for English language learners (Butler and Stevens, 2001, p. 413)
Modifications of the test Modifications of the test procedures
• assessment in the native language • extra assessment time • text change in vocabulary • breaks during testing • modification of linguistic complexity • administration in several sessions • addition of visual supports • oral directions in the native language • use of glossaries in native language • small-group administration • use of English Glossary • separate-room administration • linguistic modifications of test • use of dictionaries
directions • reading aloud of questions in English • additional example items/tasks • answers directly in text booklet
• directions read aloud or explained
Results of research on whether accommodations make any significant difference in the performance of second language learners on content- area tests is mixed; that is, that significant improvements are not always evident, and importantly that improvement depends on the student’s level of language proficiency (Butler and Stevens, 2001; Gottlieb, 2003). In some cases improved performance of English language learners in a large-scale test have been evident, for example Abedi et al. (2000) found there were differences in the performance of fourth-grade students who were and who were not given an English and a bilingual dictionary during a test.
340
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
More research is needed into the idea of tailoring accommodations to the nature of students’ language proficiency and knowledge. Because learners will benefit from accommodations differently because of differences in their proficiency, it is important that standard accommo- dations are not simply added to tests to ‘solve the second language learner problem’. As Gottlieb points out, add-on accommodations might not be valid for many second language learners.
Accommodations generally apply to intact or off-the-shelf school district or state tests, some that are high stakes in nature, that have not been conceptualised, piloted, or normed on English language learners. In that case, accommodations become the means of retro- fitting assessments that, by their very nature of development, are invalid for English language learners. (Gottlieb, 2003, p. 31)
We are not yet sure how valid or effective accommodations are in helping second language students in standardized content-area tests. Test accom- modations are therefore one alternative to large-scale content-area tests, though further research is needed to explore how accommodations can be used most effectively. (See Koenig and Bachman (2004) for a detailed dis- cussion of issues of accommodations for ELLs and research needed.)
Restructure the testing pathways for early second language learners
Gottlieb (2003) addresses the problems of large-scale assessment of second language learners in the United States by proposing a framework to restructure the large-scale assessment of second language learners. She argues that it is important that assessment should be sensitive to lon- gitudinal, individual student growth and not rely solely on large-scale, high-stakes tests that look for commonalities across large learner groups. She also argues, as I have done above, that it serves no purpose to include second language learners in large-scale assessment crafted for first lan- guage English speakers; more often than not, this practice results in penalties for second language learners and their schools.
Gottlieb firstly redefines large-scale testing. Large-scale testing, she sug- gests, can be carried out in the classroom, at grade levels/departments, at the school level, at school district level and at the state level (p. 21). Then she proposes a framework in which large-scale testing of ELLs occurs at three stages:
Large-scale testing 341
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
342
a ‘alternate’ (or alternative) school-based assessment
b school district or state assessment with accommodations
c school district or state assessment.
‘Alternate’ assessment is Gottlieb’s term for standards-based measures specifically designed for early second language learners, carried out at the school level in ways that produce defensible data (p. 25). This kind of assessment is non-standardized and is carried out at the classroom or school level. It may take a number of forms, including the following:
• a specific test reflective of ESL, content-based instruction
• an achievement measure in the students’ native language parallel to another large-scale tool
• a set of content-based tasks interpreted with standard rubrics
• a standard, student portfolio of academic performance
Gottlieb suggests that teacher-based assessment can be used as ‘a form of large-scale testing’ for second language learners, that is, that it can take the place of large-scale testing, at least for second language learner groups, when they are at the beginning stages of language proficiency. At the same time she recognizes that large-scale testing is useful at certain points in a child’s educational career to provide confirmatory evidence for administrative purposes. She advocates that classroom-based assess- ment can produce data for large-scale assessment under the following conditions:
• when standard prompts (blueprints) appropriate for students’ age and development are made available to teachers
• when content-related language samples are collected (a) in the fall to establish an initial baseline, (b) at mid-year to monitor progress and (c) at the end of the year to measure growth
• when samples of performance are collected and held in the student’s records
School-based assessment should have certain characteristics if it is to be incorporated into a state’s repertoire of large-scale assessment. It should be conducted in educational contexts that are appropriate for ELL edu- cation (e.g., support services should be in place; sufficient teaching resources should be available; the political climate of the school should be supportive); there should be use of language in the classroom that
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
facilitates students’ opportunities to demonstrate their content know- ledge (this may be in their first language); and the technical quality of the assessment should be equal to that of other large-scale assessments and of comparable rigour. Technical quality is covered when, for example, standard prompts, or tasks, appropriate for students’ age and develop- ment and anchored in specific content standards across grade levels are made available to teachers; when content-related language samples are collected by classroom teachers on a regular basis; and when samples of performance are collected and held in students’ record offices. In add- ition, there must be strong inter-rater agreement in scoring (at least 85%) and standard guidelines for the collection, analysis and reporting of data. A reporting scheme that maps both the academic achievement of ELLs and their language proficiency should be followed. In addition, ongoing validation studies and evaluation efforts should be carried out to ensure that the scheme is functioning optimally for ELLs. Thus Gottlieb is sug- gesting that school-based assessment can be made rigorous enough to replace or work alongside large-scale, standardized content-area assess- ment, at least in the students’ early stages of second language learning.
As ELL students progress in their English language proficiency (and in some cases, content knowledge), they will progress through two thresh- olds, at which point changes can be made to the testing they undergo. Students reach Threshold 1 when their school-based assessment indi- cates they are ready. Once they have reached Threshold 1, they are able to be tested through the school-district/state content-based assessment in which accommodations are provided. Students reach Threshold 2 when they are ready to take content tests without accommodations. Ideally, decisions are made about whether students have reached Threshold 2 based on their previous performance on tests with accommodations, and also on other indicators of academic language proficiency.
Gottlieb’s staged approach provides an alternative to large-scale content-area assessment for early second language learners. It reduces the dangers that large-scale content-area assessment bring for learners who are unable to show what they know because of their developing language proficiency. It recognizes that school-based assessment is of higher peda- gogical value for early second language learners, and by introducing the imperative of technical quality is attempting to win administrators over to her side to support school-based assessment at this point of their learning.
Large-scale testing is a reality in the United States, and therefore Gottlieb provides a stepping-stone, through school-based assessment, and through tests with accommodations, to full unsupported testing.
Large-scale testing 343
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
The idea of a staged approach to full content assessment is relevant to all countries and situations where second language learners are expected to be included in a large-scale content-area testing regime; it is also appli- cable to young second language learners wherever they are included in a large-scale testing regime.
Ways to address the assessment of academic language proficiency in large-scale tests
Second language learners require both social and academic proficiency in order to succeed at school (Collier, 1992; Cummins, 1980, 1984), therefore tests for second language learning should include academic proficiency, though this is not always done (Bailey and Butler, 2003). The academic lan- guage proficiency that should be assessed should be that language which is required in students’ real-life world – that is, it should be based on the mainstream content-based curriculum that they have to study. In the early elementary school years, the academic language that children use in school is tied closely to classroom activities (the language of instructions, the language of doing and talking about the physical things around them). Activities such as art and mini-project work, e.g., building models, check- ing what happens with water and sand, all require early academic language that will become more sophisticated as they go through primary school. In the middle and upper elementary years, children need to describe objects and processes, report on what they have discovered, summarize their find- ings from a library project, and so on. An effective test for second language learners should reflect the language children need in their real-life world at school, and through this, alert teachers and schools as well as parents and the children themselves, to the areas of language they need to master in order to participate fully in the mainstream classroom.
A group of researchers in the United States has been researching the nature of academic language proficiency for test development purposes (Butler and Bailey, 2002; Bailey and Butler, 2003; Bailey, Butler, LaFramenta and Ong, 2004; Butler et al., 2004). The researchers Bailey and Butler have concentrated in some components of their work (Stevens et al., 2000; Bailey and Butler, 2003) on the design phase of academic lan- guage test development. They have been determining the nature of the TLU domain and the task types that young learners at upper primary level are expected to perform in mainstream classrooms and in standardized content tests. Their first task was to establish the definition of academic
344
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
proficiency, to capture the language that students actually encounter in school. They did this by collecting language data in classrooms, and also by examining the following through a range of subjects and grade levels:
• empirical studies of student performance and the language demands of content and English language assessments
• the language prerequisites assumed in national, state and ESL content standards
• teacher expectations for language comprehension and production
• classroom exposure to all, including teacher talk and textbooks and other print materials.
Bailey and Butler’s (2003) approach to defining academic language profi- ciency draws on research by Mislevy and his colleagues in what they call ‘evidence-based design’ (Mislevy, Steinberg and Almond, 2002). Their analysis has enabled them to move towards a detailed and explicit account of the academic language needs of students. Analysis of four state content standards showed that elementary students are required to analyse, compare, describe, observe and record; at middle school level, students are required to compare, explain, identify and recognize. Analysis of the TESOL K-12 ESL standards gave further information about the kind of language necessary to achieve each TESOL goal. From observations of classrooms, the researchers found that teachers used primarily four language func- tions – description, explanation, comparison and assessment – and two repair strategies, clarification and paraphrasing. They found that student talk data revealed five predominant functions of language – explanation, description, comparison, questioning and commenting (Bailey and Butler, 2003). The working definition of academic language that the research group has adopted describes academic language at the lexical (vocabulary), syn- tactic (forms of grammar) and discourse (rhetorical) levels, with a central focus on the functions of language (Bailey et al., 2004). The work has made it clear that many English language tests are not assessing whether students have the English language skills necessary for success at school.
From the clarification of the construct (the academic language profi- ciency that primary-age learners need in mainstream classrooms), Bailey and Butler proceed to the development of task specifications that will reflect the construct. Table 9.5 is an example of a test specification for academic language proficiency assessment based on Bailey and Butler’s research. The specification follows Butler et al.’s (1996) framework for test specifications.
Large-scale testing 345
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
346
Table 9.5 Example of test specification components applied to a draft prototype academic language proficiency task (Bailey and Butler, 2003, p. 27)
Domain: Oral Language: Comprehension of Description (input) and production Explanation (output).
General description: The task will test the test taker’s ability to listen and comprehend the language of description and in turn produce the language of explanation.
Prompt attributes: The test administrator will read aloud to the test taker a short passage with specified attributes that give sentence length and complexity, breadth and depth of vocabulary, etc., as determined by textbook and classroom discourse analysis. The passage and explanation text question will be crafted to elicit the language of elaborated explanation. The task will have an academic theme or focus, but all information to provide an accurate response to the prompt will be included such that no specific content-area knowledge outside the prompt will be required.
Response attributes: The test taker will respond orally and will produce the necessary language to achieve the goals of the task, which include (1) demonstrating understanding of the language of description via responses to a series of comprehension questions, (2) using cognitive processes to infer relevant information from the descriptive passage, and (3) producing a fully elaborated explanation in response to the explanation question (see scoring guidelines under specification supplement below).
Sample item/task read aloud by test administrator:
I am going to read you a short passage and then ask you some questions about it.
A teacher specifically told a group of students to carefully place their experiments in a safe location in the classroom. One student placed his glass bottles very close to the edge of his desk. When the teacher turned around she was angered by what she encountered.
Who told the students to place their experiments in a safe location? (comprehension question) Where did one student place his experiment? (comprehension question) Who was angered? (comprehension question) Explain as much as you can why the teacher was angry. (explanation question)
Specification supplement (scoring guidelines):
(1) Test taker will need to accurately answer comprehension questions about the description heard (scored correct/incorrect regardless of language sophistication and fluency), (2) test taker will need to infer that the teacher in the prompt was angry because she saw that the student put his experiment in the wrong place and (3) test taker will need to use the language of explanation (vocabulary, syntax, and discourse) to demonstrate that understanding to the tester.
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
Table 9.5 (continued)
Rubric for scoring explanations:
Level 1: Response is characterised by an incomplete and/or incorrect answer.
Example response 1a: The teacher was angry Example response 1b: The teacher was angry because the student put the bottle on his desk.
Level 2: Response is characterised by a generally correct answer but the test taker has failed to elaborate how the inference (the bottle is in a dangerous position and could fall easily) was drawn
Example response 2a: He didn’t follow directions.* Example response 2b: The teacher was angry because the student did not follow directions.*
Level 3: Response is characterised by use of appropriate language to demonstrate a fully elaborated explanation. The test taker is able to infer that the teacher in the prompt was angry because the student put his experiment in the wrong place. The test taker demonstrates the use of the language of explanation to demonstrate that understanding (e.g., use of conditional tense for hypothetical events).
Example response 3: The teacher was angry because the student did not follow directions. He put his bottle very close to the edge of the desk, which is a dangerous place because the bottle could fall and break.
* Note from authors: 1 This example is for illustrative/conceptual purposes only and should not be seen as an operational test item. It is not a prototype to be modelled. 2 Note that in casual conversation, the explanations in Response #2a and #2b would be considered adequate. This highlights the difference between social uses of language and academic uses of language that hold speakers accountable for their claims, requiring them to verbally construct an argument citing evidence or logical conclusions to back up such claims. Moreover, these responses may be acceptable in many classrooms. Teachers may not require students to elaborate on their explanations in a way that overtly demonstrates to the teacher the necessary inferencing processes or steps in logical thinking.
Because this is a draft prototype of how academic language task specifi- cations might be presented, the task in Figure 9.3 is necessarily short, both in its prompt and in the examples of expected responses. In reality, the prompts and the length and nature of expected responses in an academic task are likely to be longer and more complex, particularly for students in upper primary school. The test specification in Table 9.5 illustrates some important points. Firstly, the test developers are not expecting content knowledge outside the prompt to be demonstrated; this is to ensure that
Large-scale testing 347
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
content knowledge does not override knowledge of the academic lan- guage they are assessing. Secondly, they are expecting cognitive processes to be activated in this task when students infer relevant information from the descriptive passage. And thirdly, they make sure that the top score (Level 3) goes to a response that has the characteristics of academic lan- guage, that is, one that requires students to use the language of explana- tion in which the conditional tense might be used for hypothetical events.
A close analysis of the construct of academic language proficiency and a careful development of test specifications lead to a more likely outcome of test ‘usefulness’. Certainly, a test of academic language proficiency, if used to further the academic language skills of young second language learners, is likely to have a positive impact on their future success at school and on their life chances.
Alternatives to large-scale testing for young language learners
Many educators have written about the advantages of alternative assess- ment over standardized assessment (Herman, Aschbacher and Winters, 1992; Genishi and Brainard, 1995; Huerta-Macias, 1995; Brown and Hudson, 1998). As we discussed in Chapter 5, proponents of alternative assessment advocate performance-based assessment in the classroom, tapping into what children can actually do in a natural and familiar lan- guage use situation. They turn away from large-scale tests that are usually paper-and-pencil, often made up of multiple-choice and discrete-point items, with children under pressure to show what they know in a limited space of time and in unfamiliar surroundings.
For young second language learners, teacher-based alternative assess- ment techniques have many advantages; for example, they can be assessed in familiar surroundings with familiar teachers, and tailored accommodations can be administered (and noted by the teacher) to help children show what they know and what they can do. This type of assess- ment has a major advantage for all stakeholders – children, parents, teachers, schools and education systems; assessment is carried out by those who spend time with the children and are able to witness the range of abilities they have. Immediate feedback is provided to children and to the teacher, and assessment runs as an underlying and supportive thread through learning. If the assessment is to be used for high-stakes deci- sions, then steps like those proposed by Gottlieb and Brindley, and
348
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
described in the previous section, need to be taken to ensure that the assessment procedures produce ‘defensible data’. If administrators are able to accept the data that are given them through these processes, then all are winners, because the data are likely to be more ‘useful’ than those collected through large-scale tests. The result is more trustworthy data for administrators, principals and parents, and a fairer and more positive assessment experience for children. Brindley (1998; 2001) has written in depth about maximizing validity and reliability in outcomes-based assessment conducted by teachers. To standardize assessment proced- ures more closely, banks of prototype or actual assessment tasks can be made available for teachers to use in classroom assessment, and engag- ing teachers in moderation activities with trained personnel can check consistency of marking. Portfolios have also been suggested for use in systematic ways as an alternative to standardized tests (Salinger, 1998). A clear specification on what is required in the portfolio and how it will be scored, accompanied by professional development and moderation, is essential if portfolios are to be used systematically to collect comparative data over large populations.
Finally, many educators believe that large-scale testing of young learn- ers need not necessarily be high-stakes. I refer once again to the EVA Project (Evaluation of English in Schools), conducted in Norwegian Ministry of Education schools by the University of Bergen (Hasselgren, 2000). (See Chapter 3 for more details of this project.) The project is different from many large-scale assessment endeavours in that the assessment procedures are not designed to provide data to administra- tors and parents, but rather to improve formative assessment in the class- room. It is therefore low-stakes for all participants. As part of the project, teachers are given tasks to use. Scoring instruments are provided to guide children to carry out self-assessment, and to teachers to help them to conduct their assessment. Teachers are given professional development on how to use the material and also how to interpret the results. Scores and profiles that are produced as a result of this assessment are regarded as indicative of ability which should be pursued further by the teacher (p. 266). Research is conducted into test results and, for example, pupils’ responses, to provide insights into the assessment process and its value. The purpose of this large-scale assessment is to improve teaching and learning on a large-scale.
In the absence of any tradition that smacks of grading in primary schools, both teachers and pupils are able to approach assessment
Large-scale testing 349
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
without prejudice and put it to positive use. It seems that, in some ways, we have got it right. There are, so far, no ‘victims’ of testing in the Norwegian primary school, and the principal challenge to those involving themselves in this area will be to ensure that the situation remains that way! (Hasselgren, 2000, p. 267)
Thus assessment materials and guidelines are produced centrally to give teachers a common set of assessment procedures from which they can learn, and around which they can engage in collegial professional development. The ultimate purpose is to support teachers in their class- room assessment. Whilst these procedures do not provide data for administrators, principals and parents, they would be able to provide low-stakes information to centrally based advisory teachers about stu- dents’ progress, who would be made aware of support needs. They are worth considering as an alternative to large-scale high-stakes tests for young learners who are vulnerable to failure and unlikely to understand the full repercussions and requirements of the test-taking process.
Summary
There are many challenges in large-scale testing of young learners. Many educators object to the use of large-scale testing, particularly with young learners, for a number of reasons, in particular because of their vulner- ability to failure, their lack of maturity which may lead to misconceptions about the test requirements, and their need for immediate feedback and subsequent adjustments to teaching.
Test development is undertaken in three phases, the design phase, the operationalization phase and the implementation phase (Bachman and Palmer, 1996). These processes help to ensure that tests are valid, reliable, practical, interactive, and have a positive impact (that is, that they are ‘useful’). The Cambridge Young Learners English Tests are foreign lan- guage tests that provide a case study of the test development process. They indicate how procedures must be systematic and in order to ensure a large-scale test for young test takers is as valid, fair and motivating as possible.
Second language learners are often required to take large-scale content-area tests normed on first language speakers. The reference point for these tests is the expected achievement of their first-language- speaking peers. Bias in these tests is created immediately from the start of test development, that is, in the design phase when the expectations
350
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core
for language and content achievement are set in relation to another learner group. The operationalization and implementation phases of test development perpetuate this bias. The result can be that the tests are invalid and unfair for many second language learners, and hence, the negative impact can be great.
Two strategies can be used to avoid the pitfalls of large-scale content tests. Accommodations included in tests help some learners, but their effectiveness is not yet established by research. A framework strategy has been proposed by Gottlieb (2003) in which second language learners proceed through a staged process, in which they are first assessed in the classroom by their teachers before reaching the first threshold, then introduced to large-scale content-area assessment with the support of accommodations within the test. Finally they move into large-scale content tests without support. Gottlieb’s strategy illustrates how complex solutions are needed to provide suitable tests for second language learn- ers in a mainstream testing context.
Large-scale tests for young second language learners require the same test development processes as those described above. As with all test development, there are special considerations for each group of learners: second language learners are learning in mainstream classrooms and testing is concerned with language and content knowledge. Tests of lan- guage for young learners in these situations need to consider children’s academic proficiency; even for young learners in mainstream class- rooms, there are early academic proficiency requirements in schooling. The work of researchers in the United States (Bailey et al., 2004) illustrates how academic proficiency for young second language learners is being defined and operationalized through an ‘evidenced-based’ approach, in which data is being collected directly from curriculum documents, teach- ers and classroom observations about the nature of academic proficiency in the primary years.
Alternative assessment is advocated by many educators as a replace- ment for large-scale testing. Teacher assessment that is strongly guided, supported and coordinated by centrally based advisory teachers, as in the Norwegian EVA project (Hasselgren, 2000), can provide a strong alterna- tive to large-scale standardized testing for foreign language programmes, and can be particularly beneficial for young learners.
Large-scale testing 351
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.010 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 17 Oct 2020 at 23:17:03, subject to the Cambridge Core