Reflective Multimodal Presentation
264
C H A P T E R E I G H T
Evaluating young learners’ performance and progress
Introduction
This chapter is concerned with the phase of assessment in which teachers and assessors evaluate the quality of children’s performance in language use tasks. How do teachers and assessors of young learners know what to look for in children’s performance? How do they establish the qualities of a good performance? What scoring methods do they use to guide their judgment about quality? And how do they do this in ways that make sure that the deci- sions that we make as a result of the assessment are as ‘useful’ as is required?
In this chapter I discuss the characteristics of good scoring rubrics, with examples of common types of rubrics that can be used in young learner language assessment. Scoring rubrics are the instructions for marking or scoring that are prepared for and by teachers and assessors. There are ways that scoring rubrics can be constructed to maximize their potential for effective scoring. Finally, in this chapter, some examples of standards are given and discussed, since standards are used more and more commonly to evaluate young learners’ performance and progress over time.
A note on evaluating performance during classroom formative assessment
The kinds of assessment decisions classroom teachers make about chil- dren’s performance are often embedded in the busy-ness of teaching and
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
are mainly aimed at improving learning. Some commentators suggest that there are qualitative differences between the decisions that teachers make to evaluate performance in this kind of assessment, compared to decisions made by teachers and assessors to evaluate performance in more formal assessment tasks. They suggest that to make formative assessment decisions in the classroom, teachers generally rely on their internalized set of understandings about what kind of performance they should expect to see. The internalization of criteria is vital for teachers if they are to give instant feedback as part of formative assessment, and to follow it up with scaffolding and further instruction. Teachers’ criteria are gleaned from a range of sources – from the curriculum, from standards or benchmark documents, from the textbook or from externally devised criteria such as those from standards, or from external tests that will be used to measure their children’s (and, indirectly, their own) performance. This teacher’s comment shows how she learned to internalize the criteria from standards rather than deal with them all on paper.
Well, I used to use these sheets and I used to spend hours ticking this off and ticking that off and trying to work this in. It helped me look for things. I used to hear teachers saying ‘Oh, it’s all up here, it’s in my head.’ I used to wonder how they just knew where the child’s at and what they can and can’t do. But I can actually do that now and working through all the checklists and all the information that I thought I had to collect was far too much. (Leigh)
(Breen, 1997, p. 119)
Another line of thinking is that rather than internalizing criteria from checklists and standards, teachers tend to use their own constructs, or those shared by the community of practitioners, rather than those estab- lished through standardized assessment criteria (Leung, 2004, p. 24). Leung (2005), following the work of Wiliam (2001) and others, suggests that teacher assessment might best be described as ‘construct-refer- enced’ assessment. In construct-referenced assessment the construct is held in the mind of teachers when they make judgments about perfor- mance. There is a common understanding of what is required; the crite- ria are not necessarily all written out and made explicit. This suggestion is made because there is some question about how formal and how ‘mea- sured’ formative, and particularly on-the-run, assessment can be.
We do know that formative assessment is a complex and individualized process. Teachers hold in their minds the current performance of each child and are looking for the next expected gains for that child. Teaching occurs, and records are completed based on that child’s individualized
Evaluating performance and progress 265
265
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
learning progress. Records tend therefore to be positive notes of gains in learning (‘can now label all the parts of the body’), rather than negative based on generally expected gains for the whole class (‘is not yet able to write a short letter to a friend’).
Researchers in teacher assessment are suggesting that validity and reli- ability in formative classroom assessment processes need to be viewed in different ways from those in formal assessment. They suggest that valid- ity in this type of assessment rests in the construct held in the teacher’s mind being shored up by sufficient teaching experience, and by evidence that learning has taken place. They propose that reliability is best gained through the collection of sufficient observation data over many tasks (McMillan, 2003; Smith, 2003). It is also suggested that teachers can best evaluate the assessment process in formative assessment by checking on the learning that has taken place as a result of their assessment and feed- back ( Torrance and Pryor, 1998). We need to know more about the nature of the kinds of evaluations teachers make of children’s performance during formative assessment (Leung, 2005). Meanwhile, the principles behind the following procedures are still thought to form an important basis for evaluating children’s performance in the classroom. For example, even with formative assessment it is valuable for the classroom teacher to make explicit, from time to time, the criteria being used and decisions being made, and to check their appropriateness with col- leagues. This is especially so when stakes are high.
The scoring method
In order to evaluate children’s performance in the most appropriate way possible, a scoring method is needed. The scoring method consists of (1) the criteria by which students’ responses are evaluated and (2) the pro- cedures followed to arrive at a score (Bachman and Palmer, 1996, p. 194).
1. Deciding on the criteria by which students’ responses are evaluated
The definitions of performance that assessors look for when they are deciding what constitutes successful performance are generally called cri- teria. When criteria are written in scoring rubrics they are often also called descriptors. Criteria may be written as headings (Fluency; Accuracy), as
266
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
statements (Can participate in group activities) or as questions (Is the learner able to write a short letter to friends with appropriate informal lan- guage?). Criteria may be broadly defined (‘Can write a letter’) or more specifically defined (‘Can pronounce final consonants clearly’), depending on what criteria assessors are looking for. The criteria that teachers and assessors select as part of the scoring method relate back to the definition of the construct they are assessing, and those criteria they select ‘opera- tionalise the construct’ for the performance (Bachman and Palmer, 1996, p. 194). Well-defined sets of criteria should therefore be theoretically based and not just randomly selected. Thus, the criteria for an oral presentation in class need to come from a theoretical understanding of what constitutes a successful oral presentation in class.
The more specific the criterion, the clearer and more objective can be the decision about the child’s performance, but this may mean that the information that is gleaned may be very specific and therefore not particularly informative for an assessment of language use. Thus, if the criteria are correct/incorrect as follows:
“Does the child answer ‘Yes, that is a dog’?” Yes/No
or
“Does the child draw a line between the horse and the field?” Yes/No
the teacher or assessor has an easy, objective decision to make. The cri- teria are absolutely specific, and the answer is either yes or no. The child will probably have participated in a certain degree of language use to complete these assessment items (in the first, the child identifies the dog and answers a question; in the second the child has understood a ques- tion and followed a command), but there is a limited amount of language use possible in assessment items of this kind. Language use tasks that are likely to assess and promote language use involve interactiveness and authenticity (Bachman and Palmer, 1996), as described in Chapter 4, and are less likely to be tasks that can be marked as simply ‘correct’ or ‘incor- rect’. Objective criteria, with yes/no decisions, are not very appropriate for language assessment. As Davidson and Lynch (2002) point out,
In practice, this approach . . .[using specific criteria] . . .works only with domains that can be narrowly defined, such as basic mathematical operations or discrete point grammatical features, and is impossible to use with broader, more complex areas of achievement, such as com- municative language ability.
(Davidson and Lynch, 2002, p. 12)
Evaluating performance and progress 267
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
To assess language use, we have to accept a lesser degree of specificity in the criteria, because language use involves a number of different elements that together render the language use successful or not. For example, it is much more difficult to answer the following criteria with correct/incorrect. The marker has to make a personal, or subjective decision to answer the question because there is a complex decision to be made about whether all the elements came together effectively into successful language use.
Did the child describe what she did on the class outing successfully?
Did the child retell the main events in sufficient detail?
Criteria such as these are more generally used in assessment of language use; they are often combined into groups of criteria, as the examples of scoring rubrics in this chapter illustrate. Assessment criteria for most lan- guage use tasks can never be made fully explicit (Brindley, 2001). The wording of criteria for language use tasks is rarely precise, and a decision on performance inevitably requires some interpretation by teachers and assessors. To work towards promoting children’s ability to use the lan- guage, teachers and assessors therefore have to find ways to work with less-specific criteria, and to work with some imprecision in marking (as is discussed in the next section).
2. Deciding on how to arrive at a mark or score
To determine the second component of the scoring method, deciding on how to arrive at a mark or score, teachers and assessors need to decide how the child’s response will be marked or scored. Will there only be a right or a wrong answer to each question or item, and thus a set of marks that will be added up? Or will there be varying degrees of correctness? There are different ways that a performance can be scored. If the task is a picture-matching task, a multiple-choice or a true/false task, then dichotomous scoring is likely to be used, that is, each response will be marked as either correct or incorrect. The final mark is arrived at by adding the correct responses together. Cloze and gap-filling task items are usually marked correct/incorrect; there may sometimes be two or three acceptable responses (e.g., ‘home’ and ‘house’) listed in the marking schedule, but in the end the answer is considered correct or incorrect. There are also ways that correct/incorrect scoring can be used effectively with multiple criteria, that is, scores can be reported sep- arately for different areas of language ability by using a method of partial
268
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Evaluating performance and progress 269
credit scoring (see Bachman and Palmer, 1996, pp. 199–202). When dichot- omous and partial credit scoring items are carefully constructed, teachers and assessors can quickly score the child’s performance and gain valuable information about a child’s needs, strengths and weaknesses.
In language use tasks, where the quality of the performance is to be evaluated, scoring procedures are usually constructed in ways that enable markers to select a level at which children are performing. Markers choose amongst groupings of criteria that are designed to represent as well as possible the performance level of the language user. There are weaknesses in this kind of scoring procedure, as I will point out below, but the group- ing of criteria into levels is considered to be the most effective procedure found to deal with the complexity of scoring of language use.
Scoring rubrics and reporting scales
Scoring rubrics are ‘instruction sheets’ guiding the scoring of performance and may be designed and used by teachers for their own classroom assess- ment, or by assessors for use by many markers involved in, for example, the scoring of a large-scale test. In some parts of the world, scoring rubrics are also known by teachers of young learners as criteria sheets (for certain kinds of rubrics) or scoring guidelines. Scoring rubrics generally provide the framework for the scoring method; that is, they contain both the criteria that are to be used (reflecting the construct to be assessed) and guidelines on how a mark or score will be arrived at. Scoring rubrics for language use tasks may be written for performance on a single task, or they may be written for performance across a range of tasks. Different types of scoring rubrics are used by teachers and assessors depending on the purpose, the audience and the context of the assessment. The type of scoring rubric that is used is deter- mined on the basis of the construct definition; that is, on what is to be assessed. Thus, if the construct to be assessed is the ability of the young learners to perform a letter-writing task to a friend, then the rubric will be constructed around the nature of the expected ability of the young learners in question to write an informal letter to a friend. Once scores are decided, they are reported to parents, learners and others. Reporting scales provide general descriptions of performance against which teachers and assessors can report on children’s achievement. These same reporting scales can also provide general descriptions that can inform and guide the develop- ment of scoring rubrics. Performance standards, which I discuss below, are commonly used reporting scales in education departments.
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
A framework for the evaluation of scoring rubrics and reporting scales
How do we know if a scoring rubric and reporting scales are effective and appropriate? The evaluation of scoring rubrics can be complex, as a range of considerations about purpose, validity, reliability, practicality and use come together in scoring rubrics as they do in the selection of assessment tasks. This section sets out some of the main considerations in the evaluation of scoring rubrics for young learners. Alderson (1991) has referred to three distinct purposes for rating scales: Constructor- oriented scales (designed to guide the construction of tests), assessor- oriented scales (designed to guide the rating process) and user-oriented scales (designed to provide useful information to users who will be inter- preting the test scores). The first purpose is not relevant to the discussion here; I refer to the last two categories of scales: assessor-, or marker- oriented, and user-oriented purposes, with a general category, to orga- nize a set of considerations for the evaluation of scoring rubrics and reporting scales.
Marker-oriented considerations are those that relate to the needs of markers (both classroom teachers and external assessors) as they evaluate children’s performance through scoring rubrics. User- oriented considerations are those that relate to the needs of users such as the children themselves, parents, other teachers and adminis- trators as they read about performance described through reporting scales. General considerations are relevant to both scoring rubrics and ratings scales. Table 8.1 summarises these considerations, which are then discussed briefly below. Some assessment tools, such as rating scales, can be used as both scoring rubrics and reporting scales; they should then be evaluated according to the purpose(s) for which they are being used.
General considerations
Are the scoring rubrics and reporting scales appropriate for young learners?
Scoring rubrics and reporting scales used to evaluate and report on young learners’ performance need to reflect the language use characteristics of the learner group in question and their learning context. In the case of
270
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
young learners, scoring rubrics should therefore reflect the characteristics of the expected language performance of young learners in the programme in which they are learning. They should reflect the developmental and literacy features that we know about young second learners’ language use,
Evaluating performance and progress 271
Table 8.1 Considerations in the evaluation of scoring rubrics and reporting scales for young learners
General considerations (for scoring rubrics and reporting scales)
• Are they appropriate for young language learners? • Do they reflect the developmental needs and first and second language
literacy growth characteristics of young learners? • Do they reflect the curriculum and the learning opportunities that children
have?
Marker-oriented considerations (for scoring rubrics)
• Do they reflect the purpose for assessment? • Do they reflect the construct that is to be assessed? • Are they clear and logical and therefore as unambiguous as possible for markers?
• Are they written in clear and objective language? Do markers have sufficient guidance on what mark should be given for what kind of performance?
• Are the numbers of levels feasible? Are they accompanied by samples of work?
• Are they practical, in the sense that they do not make unreasonable demands on markers’ time?
• Will they promote a fair assessment of all young learners? • Do they avoid descriptions or procedures that advantage or disadvantage
some children because of their culture, sex or socioeconomic background? • Will their use promote markers’ professional understandings about learning?
User-oriented considerations (for reporting scales)
• Do the reporting scales reflect their purpose? • Are they meaningful to those who will use them (learners, parents, other teachers,
administrators)? • Can users handle the language of the descriptors, the different dimensions
and levels? • Will the reported scores generated from the scoring rubrics serve users’ needs?
• Will they provide the information that is needed (e.g., about what to teach; about what to report to others, about resource allocation)?
• Will they have a positive impact on users? • Will users’ understandings about the nature of second language learning be
enhanced? • Will they have a positive impact on the learning, and more broadly, on the
lives of young learners?
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
and they should reflect the curriculum and learning opportunities to which children are exposed.
Marker-oriented considerations
Do the scoring rubrics reflect the purpose for assessment?
Scoring rubrics should reflect the purpose for the assessment. They might be prepared for formative or summative assessment in the class- room, where the classroom teacher is marking children’s performance, or they might be used in external tests that are to be marked by assessors. The purpose for the assessment will influence the degree of precision and detail that is required in the scoring rubric. The appropriateness to purpose will be reflected in the kind of information that is generated. If the purpose is to provide diagnostic information for teaching purposes, then detail about children’s performance will be required. If the purpose is to place the child on a level for purposes of comparison with other children in the cohort (say, of Grade 6 children in the state), then less information needs to be generated. In classroom informal assessments, criteria sheets and checklists can be prepared quickly enough to guide scoring; for summative assessment and in large-scale tests, more prepa- ration and more detail, perhaps in analytic rating scales, is likely to be needed. For diagnostic assessment, analytic scales provide more infor- mation and therefore might be more suitable than holistic rating scales, as I will discuss below with regard to rating scales.
Do the scoring rubrics reflect the construct that is to be assessed?
Scoring rubrics need to reflect the construct that is to be assessed. Thus, if they are to be used to evaluate children’s second language perform- ance in an interview, then the characteristics of young learners’ inter- view performance need to be established and identified through descriptors. The attributes that underlie successful performance should be identified in the scoring rubric, wherever possible based on a theoretical basis about the nature of that performance. Whilst, ideally, all the important characteristics of expected performance should be included, it is often not practical to include all aspects of performance, and it is better to indicate the salient characteristics that markers
272
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
should be looking for. If all features of performance are included, the scoring rubrics become too complex to use (Greatorex, 2003, p. 128). It is therefore best if rubrics exemplify the performance rather than describe every feature of performance required (Black, 1993, in Greatorex, 2003, p. 128). Salient characteristics may be criteria that rep- resent the core features of that performance, or new understanding and achievements that teachers are looking for from the learner group in question.
Are the scoring rubrics understandable and logical, and therefore as unambiguous as possible for markers?
Without scoring rubrics of some kind, teachers are likely to be reliant on tacit knowledge, or on ‘expert notions’ of quality to evaluate and assign a mark to the piece of work. The use of expert knowledge is an important contributor to classroom assessment activities and can have some reli- ability of its own if there is ‘sufficiency of information’. However, as I pointed out above, a lack of guidance in the evaluation of performance may lead to threats to reliability. Scoring rubrics play a valuable role in making explicit the criteria by which the performance should be evaluated. It is important, therefore, that criteria are clearly defined and that levels are well articulated (Bachman and Palmer, 1996; Weigle, 2002). However, because scoring rubrics rely on verbal description, there is often some ‘fuzziness’ or vagueness in descriptors (Sadler, 1987, p. 202). According to Sadler (1987, p. 202) there are two types of descriptors; sharp descriptors describing features that are either present or absent (return address, date, greeting) and fuzzy descriptors that are matters of degree (clear solutions, superior writing ability, neat appearance). Scoring rubrics are likely to include both types. Fuzziness is seen in scoring rubrics when writers attempt to establish progress across levels using relative terms (none, little, some, much) and when they combine these relative terms with words like ‘coherent’ that require interpretation (incoherent, somewhat lacking in coherence, reasonably coherent, highly coherent). Despite these difficulties, Sadler suggests fuzzy descriptions should not be seen as inferior to sharp ones, or that they should be avoided if at all possible because of their lack of precision and their inability to guide assessment in a definitive way.
Verbal description cannot . . . be sharper or more precise than lan- guage will allow. In particular, fuzzy standards cannot be transformed
Evaluating performance and progress 273
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
into sharp standards simply by using more detailed or elaborate lan- guage, for much the same reason that there are practical limits to the degree of improvement that can result from using a magnifying glass on a blurred photograph. The element, whether it be a group of words or a cluster of silver particles, needs a context for it to be properly interpreted. (Sadler, 1987, p. 206)
For this reason, Sadler suggests that samples of work should be used with verbal descriptions to exemplify the meaning of the descriptions, and that the number of exemplars can probably be made fairly small provided they are accompanied by explicit annotations of the properties of indi- vidual samples of work (Sadler, 1987, p. 207). Brindley, too (1998, p. 70), suggests that a library of exemplars of student performance can be built up to accompany scoring rubrics, in this case outcomes-based reporting frameworks, to help teachers to interpret and understand (Gipps, 1994) the assessment criteria. I discuss below ways in which reliability can be maximized in the use of scoring rubrics.
An important issue in scoring rubrics that have levels of performance is how many levels of performance there should be. The number of levels that are chosen depends on considerations of ‘usefulness’ (Bachman and Palmer, 1996, p. 212). The number of levels in a scale will depend on whether markers can reasonably be expected to distinguish performance at each level. It also depends on impact. If the purpose of the assessment is to place children into three different learning groups or classes, then three levels may be sufficient. If the purpose is to map children’s progress in enough detail to help teachers understand the nature of their perform- ance and thus their need for teaching support, then more levels are likely to be needed.
Will the scoring rubrics promote a fair assessment for all learners?
Both markers and users hope for a fair assessment for all learners. As I dis- cussed in the framework for the selection of tasks in Chapter 4, scoring rubrics should avoid descriptions or procedures that advantage or disad- vantage some children because of characteristics such as their culture, sex or socioeconomic background. As an example, if criteria in the scoring rubrics are asking for performance that may favour boys over girls (e.g., a criterion that children write at least a page on a football game showing evi- dence of understanding about the game, and using appropriate football- related vocabulary), then there is likely to be some bias, both in the task
274
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
and in the scoring rubrics. By being appropriate for young learners and their context, appropriate to their purpose, and reflecting the construct to be assessed, they are also more likely to be fair for all learners.
Are the scoring rubrics practical?
Scoring rubrics need to be practical for the time and context in which they will be used. In the classroom context, teachers need rubrics that can be developed and used within the limited time available in classroom teaching. In this context, rubrics that are generalizable to several tasks save the time and energy required for their development. Brindley reports, however, that teachers will not necessarily reject assessment systems that increase their workload, if they perceive value in the infor- mation they gain for learners, teachers and parents (Brindley, 1998, p. 65). In formal testing situations, when scoring rubrics are developed by teams, and groups of markers are brought together to be trained and then to mark collaboratively and under supervision, more complex scoring rubrics may be considered practical.
Will their use promote professional understandings about learning?
All materials to which teachers refer are bound to have some influence on their thinking. Scoring rubrics, especially in classroom teaching contexts, are able to inform professional understandings. Therefore, scoring rubrics that are prepared with proper attention to learner group, purpose and construct, as well as to clarity and fairness, are more acceptable in terms of influence on professional thinking, than those that are not.
User-oriented considerations
Do the reporting scales reflect their purpose?
Reporting scales are designed to establish the range of possible perform- ance, and thus to establish and facilitate reporting of the level of children’s performance. Large-scale reporting scales in the form of performance stan- dards carry complexity in their purpose, since they can have many pur- poses; they set out to establish the curriculum (what should be learned), to
Evaluating performance and progress 275
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
provide a hierarchy of achievement for reporting purposes, and are often also used for accountability. There may be tension between these purposes, and this can cause confusion for users and weaknesses in the standards, as I will discuss below.
Are the reporting scales meaningful to those who will use them (learners, parents, other teachers, administrators)?
Reporting scales should be meaningful not only to markers, but also to others who use them. Thus, children (depending on their age and the complexity of the criteria in the reporting scales) can be guided to under- stand what is required in a task to judge their own performance, and what aspects of their performance need to be improved. Parents can be given a better opportunity to understand the expectations of tasks, or the requirements across a term’s work, when scoring rubrics are available to them. Administrators prefer straightforward measurable scoring rubrics that provide them with the answers they are looking for, without making the evaluation too complex.
Will the reported scores generated serve users’ needs?
Some reporting scales are not designed for other users, but simply for the classroom teacher. Some are designed with the learner in mind; they can help children to understand the strengths and weaknesses of their own performance. Other reporting scales are shared with parents, or other teachers. Yet again others, such as performance standards, are written for teachers, parents and administrators to understand. Thus, it depends on the purpose of the reporting scales as to whether they will serve users’ needs, and, as I discuss later in this chapter, in some cases conflicting purposes result in tensions in the way reporting scales are presented and used. The report also needs to meet users’ needs. In some cases, for example when parents are receiving a report, profile reporting might be best; in others, for example when administrators are making resourcing decisions, a single score or a set of scores is more likely to be required. Administrators need reporting scales that will generate the kind of information they need to make administrative decisions, related to resourcing, reporting on trends, and accountabil- ity. For this reason, administrators like certain kinds of performance standards, as I will discuss in the last section of this chapter. Developers
276
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
of reporting scales need to decide on the users’ tolerance for discrep- ancies ( Weigle, 2002, p. 127) and to be as clear as possible about purpose and user needs.
Will the reporting scales have a positive impact on users?
In many cases, reporting scales will only be seen by markers, that is, by classroom teachers and assessors of external tests. In some cases learn- ers, parents and other teachers (for example, mainstream teachers) are shown reporting scales, and performance is discussed with them. Administrators are presented with patterns of performance of cohorts of learners as they monitor data from tests and teacher reports. The mes- sages that learners, parents, teachers and administrators receive from the assessment process should aim to have a positive impact, promoting, in a broad sense (that is over cohorts and over time), understanding and notions of strengths and progress rather than weakness and failure. If assessments, of which reporting scales are a part, result in disempower- ment of young learners, perhaps through a loss of self-concept or a sense of inferiority in the community, then the assessments are not ‘useful’ and have failed (Cummins, 2000). This can happen, for example, when chil- dren are observed erroneously to be failing over several years when first language reporting scales are used to monitor their second language and literacy progress (Davison, 1999; McKay, 2001; Davison and McKay, 2002).
Types of scoring rubrics and reporting scales used with young learners
This section presents some different kinds of scoring rubrics used in the evaluation of young learners’ language performance. Some of these are intended to be used mainly by classroom teachers, while some could be used both in the classroom and in large-scale assessments. Observation checklists are commonly used by classroom teachers of young learners as they observe, note and check off children’s performance during class- room teaching and learning activities. Criteria sheets are often teacher- constructed and are regularly used by classroom teachers for classroom assessment, though they may also be constructed by assessors for tests. Rating scales setting out worst to best performance to guide assessment decisions are more likely to be used in planned assessment tasks in the
Evaluating performance and progress 277
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
classroom, and in external tests where scales are prepared and used to reflect the range of performance (from weak to strong) that is expected by the particular group of children taking the assessment procedure. Rating scales may be used as reporting scales to report on achievement. Standards are also a type of reporting scale but are generally designed to report on performance across abilities and over time. Standards set out expected curriculum-related performance and progress over a number of years. Some examples of standards are given and discussed in a section at the end of this chapter.
Observation checklists
Classroom teachers often devise their own observation checklists to check that their learners are achieving the objectives they have set for learning, over a unit of work, or a length of time. An observation checklist is usually made up of points of observable behaviour that can be checked off, such as in the checklist in Figure 8.1.
When observation checklists are teacher-constructed and based on previous observations of learners, they are more likely to be appropriate
278
Name:
Term:
Theme: My favourite animal
Yes/No Comments Teaching ( When? Where? points How well?) to follow up
Can name their favourite animal
Can label the parts of the anima
Can describe the colours and shapes of the animal
Can ask someone else about their favourite animal
Can tell a story about their animal, with the help of a paper model and pictures
Figure 8.1 An example of an observation checklist.
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
to the learner group and learning programme. The decision regarding the observation may be a competency decision, that is, either children are competent or they are not in each criterion to be checked, or it could also be a profile rating, in columns under relative terms such as ‘emerg- ing’, ‘developing’ and ‘consolidated’. It is very unlikely that teachers require a final mark to be tallied; the checklist is a profile that gives the teacher a sense of what each child can and cannot do, and this is sufficient for the information that is needed for teaching purposes, and for verbal descriptions of progress in reports to parents and others.
Teachers may also select descriptors out of developmental continua, place them in their checklist, and observe children’s progress against these indicators over time. Developmental continua are usually written by educational professionals attached to Education Departments or sometimes by commercial organizations. They describe the expected progress of learners, which are usually a combination of developmental and curriculum-related growth. The ‘First Steps’ materials, an extract from which is given in Table 8.2, are materials that provide develop- mental continua in oral language, writing, reading and spelling for first langauge learners.
Evaluating performance and progress 279
Table 8.2 Extract from developmental continuum for first language reading development (Education Department of Western Australia, 1997c)
Making meaning at text level (early reading level)
The reader: • is beginning to read familiar texts confidently and can retell major content from
visual and printed texts, e.g. language experience recounts, shared books, simple informational texts and children’s television programmes
• can identify and talk about a range of different text forms such as letters, lists, recipes, stories, newspaper and magazine articles, television drama and documentaries
• demonstrates understanding that all texts, both narrative and informational, are written by authors who are expressing their own ideas
• identifies the main topic of a story or informational text and supplies some sup- porting information
• talks about characters in books using picture clues, personal experience and the text to make inferences
• provides detail about characters, setting and events when retelling a story • has strong personal reaction to advertisements, ideas and information from infor-
mational texts, making links to own knowledge • makes comparisons with other texts read or viewed • can talk about how to predict text content, e.g. ‘I knew that book hadn’t got facts in
it. The dinosaurs had clothes on.’
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Teachers may also refer to performance standards to select relevant observation criteria. They add selected descriptors from the develop- mental continua and/or the performance standards to their observation checklists, usually checking off criteria as they see evidence of the stated behaviour in classroom activities or assessment tasks. Foreign and second language teachers need to be cautious about selecting descrip- tors from developmental continua and performance standards designed for first language learners. Depending on their background experience and age on entry into learning the target language, second language learners’ literacy development can differ quite markedly from that of first language learners. For example, young language learners will make grammatical ‘errors’ in their structures that do not appear in first lan- guage learners’ writing, yet these ‘errors’ are indications of the creativity essential for healthy language learning progress in foreign and second language learning (see McKay, 1998). Developmental continua designed for first language learners also may not consider different cultural back- grounds in their criteria. In these cases, the criteria that are included in teachers’ observation checklists may not be appropriate for young learn- ers and their learning programmes, and may not reflect the construct of the assessment (that is, second language learning). The clarity, practical- ity and fairness of observation checklists depend on factors such as the kinds of descriptors teachers select and how they make their decisions about competency (e.g., how many times should teachers observe the behaviour in question before it is considered a competency?). When teachers list the characteristics of performance to be observed in collab- oration with others, and with reference to appropriate documents, this is likely to assist their professional understandings.
When observation checklists are used primarily for formative assess- ment purposes, there are no other users’ needs to be met other than the children’s, who may, with help, have a chance to understand more about what is required and how they are progressing. When teachers’ observa- tion sheets become the tool for evaluation and reporting of children’s progress to administrators, the stakes are raised. Administrators require a simpler report such as an indication of the child’s level(s) on perfor- mance standards. Interpreting data from observation checklists into levels on performance standards requires a high degree of teacher sub- jective judgment, and administrators find this difficult to accept as reli- able. Conversely, this kind of reporting requirement can have a negative impact on the use of checklists for their original purpose, formative assessment.
280
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Criteria sheets
A criteria sheet is a one-sheet table of descriptors, accompanied by a marking scheme, that guides markers to evaluate the quality of performance in a particular task. Criteria sheets are commonly used by classroom teachers of young learners for both formative and summa- tive assessment purposes. In both the following examples, the overall construct to be assessed is the completion of the task (Can the child complete the task successfully?) and a set of criteria, organized into dimensions (categories of criteria), which define the constructs or con- tributing knowledge and skills to be assessed. These criteria are usually prepared by the teacher or by assessors with close consideration of the knowledge and abilities that are considered to make up successful com- pletion of the task. In the first example in Figure 8.2, the dimensions considered relevant to an oral presentation to a class are ‘text context and organization’, ‘vocabulary and sentence structure’ and ‘responsive- ness’, and specific criteria are listed within these dimensions. In the second example in Figure 8.3, dimensions and criteria for process report writing have been conceptualized with reference to systemic functional linguistics. Because criteria sheets are constructed for specific tasks, and usually with a particular learner group and learning programme in mind, they are more likely to meet the evaluation crite- ria for appropriateness for the learner group, the learning programme and the construct. Because they contain detail within the dimensions and criteria, they are appropriate for formative assessment and also, in more formal assessment, can give clear guidelines on the construct, or the characteristics of the performance required.
In both the examples, the scoring procedure is on the right hand side: teachers rate children’s performance as low, medium or high, as in Figure 8.2, or, as in Figure 8.3, ‘very competent’ to ‘not yet’. The example in Figure 8.3 has a scale at the bottom for teachers to give a final score. This type of scoring device is generally considered suitable for formative and low-stakes classroom-based assessment, since teachers use their own ‘expert notions’ or internal criteria (developed over time and based on shared understandings and experience) to establish what is low, medium or high performance. For some teachers, though, their internal criteria may not be well established, and their decisions about low, medium or high performance may not be appropriate. In these cases collaboration with other teachers in discussions of marks for different samples of work is beneficial.
Evaluating performance and progress 281
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Student’s name ____________ Level/stage_____________Date__________
Characteristics of the student to note:
282
Description of task (including characteristics of setting, input, etc.)
News telling: Oral presentation to class
Additional support given to this student
Assessment criteria Comment low high
Text content and organization • includes key information
(where, when, who, what) • provides appropriate
elaboration and detail • maintains fluency • concludes appropriately
Vocabulary and sentence structure • connects ideas using appropriate
conjunctions (and, but, then, unless, so)
• uses adjectives • uses varied and specific
vocabulary • is generally accurate in structure • articulates words clearly
Responsiveness • is aware of interest needs of
other children • makes appropriate eye contact • responds appropriately to questions
Comments
Final mark:
Figure 8.2 A criteria sheet for a news telling oral presentation to class.
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Evaluating performance and progress 283
Description of the activity: Students share or report on a process they used to complete a task in an individual or group activity
Name of student: Year level/class: Date: Name of school: Teacher: Title of narrative:
Criteria [Tick appropriate box] Very Compe- Limited Not Comment compe- tent compe- yet tent tence
Ability to carry out the task: Did the student • share or report willingly • share or report with minimal
teacher support [questioning or prompting]
Structure and organization: Did the student • set the context, e.g. ‘For the
plant project we . . .’ ‘Our experiment was called . . .’
• give sufficient detail • sequence information, e.g.
‘First we did x . . . then we . . .’ • keep on the topic
Language features: Did the student • use a variety of time phrases,
e.g. ‘then, next’ • use specific vocabulary, e.g.
stiff, beat, beaker, apparatus • use verb tenses accurately
and consistently, e.g. ‘when it was added . . .’
• use pronoun references accurately, e.g. it, this, these, those
Communication skills: Did the student • speak fluently without too
many hesitations, e.g. ‘um, er’ • speak clearly, e.g. pronounce
words accurately, sound plurals and verb endings
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
If the purpose for assessment is formal and high-stakes, then this type of scoring rubric is not sufficient; criteria sheets should be supplemented with rating scales that establish more clearly what ‘low’, ‘medium’ and ‘high’ performance looks like. Samples of work exemplifying the levels can also support the rating process. This then helps the criteria sheet to be as clear and unambiguous as possible for markers. Criteria sheets, as many teachers know them, are related to a more formal approach to task- based assessment called primary trait scoring.
The philosophy behind primary trait scoring is that it is important to understand how well students can write within a narrowly defined range of discourse (e.g., persuasion or explanation). In primary trait scoring, the rating scale is defined with respect to the specific writing assignment and essays are judged according to the degree of success with which the writer has carried out the assignment.
( Weigle, 2002, p. 110)
In primary trait scoring a number of components are required.
(a) the writing task (b) a statement of the primary rhetorical trait (e.g., persuasive essay,
congratulatory letter) elicited by the task (c) a hypothesis about the expected performance on the task (d) a statement of the relationship between the task and the primary
trait (e) a rating scale which articulates levels of performance (f ) sample scripts at each level (g) explanations of why each script was scored as it was.
( Weigle, 2002, p. 110)
284
• self-correct, e.g. ‘then he poured . . stirred’
• refer to finished product to enhance meaning [optional]
General comments:
Global rating: [circle] lowest 1______2_____3_____4_____5 highest
Figure 8.3 A criteria sheet for a report on a written process (based on the genre approach) (Education Department of South Australia, 1990).
Note: This example is designed for second language learners who are in upper elementary and who have written a science experiment.
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
These components are required in high-stakes assessment situations. However, it should be noted that this becomes a time- and labour- intensive exercise (Lloyd-Jones, 1997, cited in Weigle, 2002, p. 110).
Criteria sheets, like observation checklists, have many strengths in teacher assessment situations, and for the same reasons. They can be developed to reflect the learner group, the learning context and the spe- cific constructs in question. They differ from observation sheets in that they may be used in summative testing, and in external tests. Their fair- ness then rests in the employment of strategies to maximize reliability in marking, as listed above by Weigle, and as discussed below. The main users of criteria sheets are learners who can be guided to understand the requirements of tasks and to self-assess where their cognitive maturity allows for this.
Holistic rating scales
A holistic rating scale provides descriptions of ability at a number of different levels. These levels are provided on a single scale, which is divided into bands or levels labelled in various ways, for example from ‘needs improvement’ to ‘good’ to ‘outstanding’, or from ‘Level 1’, to ‘Level 5’. Holistic scales can be constructed from curriculum or theory- based definitions of language ability. Holistic scales are generally used when groups of teachers or assessors can work together to produce them and then to share them. They are often borrowed from publications or from each other by teachers and adapted to fit the tasks they are assess- ing. Figure 8.4 gives an example of a holistic rating scale designed to guide teachers and assessors to score a literature response.
The marker selects which of these levels best describes the child’s performance, in order to arrive at a decision about the quality of the performance. It may be that the child’s performance doesn’t meet every criterion in the level that is chosen; the usual practice is for the marker to select the level that reflects the performance most closely.
The second example of a holistic scale, in Table 8.3, is designed to assess writing across a number of writing samples. This scale has also been constructed around unarticulated dimensions through each level; organization, grammar, vocabulary and mechanics.
The advantages and disadvantages of holistic and analytic scoring are discussed by many writers, including Weigle (2002, pp. 112–14).and Bachman and Palmer (1996, pp. 219–22). Holistic scoring is faster (and
Evaluating performance and progress 285
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
therefore less expensive) than analytic scoring (discussed below) because markers can make an overall assessment quickly; their attention is focused on certain aspects of the performance, and this can guide them to know what is salient or important for a successful performance. It is argued that holistic scoring is more authentic, because it reflects more closely on a personal reaction of a reader to a text, or of an observer to a performance, than in analytic scoring methods ( Weigle, 2002, p. 114). Some disadvantages of holistic scoring are (1) that holistic scoring pro- vides a single score, and therefore useful diagnostic information may not be collected about the performance, and (2) holistic scores are not always easy to interpret, as raters may not use the same criteria to arrive at the same scores; for example, some markers may put more store into gram- matical accuracy than others. Markers may also develop their own inter- nal rating scale and this may drift away from the intended scale.
286
Outstanding Describes most story elements (characters, setting, beginning, middle and end of story) through oral or written language or drawings Responds personally to the story Provides an accurate and detailed description of the story Develops criteria for evaluating the story
Good Describes most story elements through oral or written language or drawings Responds personally to the story Provides an accurate description of the story with some details Analyzes something about the story (plot, setting, character, illustrations)
Satisfactory Describes some story elements through oral or written language or drawings Makes a limited personal response to the story Provides an accurate description of the story Explains why he or she likes or does not like the story
Needs improvement Describes few story elements through oral or written language or drawings Makes no response or a limited personal response to the story Provides a less than accurate description of the story States that he or she likes or does not like the story
Figure 8.4 Example of a holistic rating scale for a specific task (a primary trait rating scale): A Literature Response (O’Malley and Valdez Pierce, 1996, p. 113).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Evaluating performance and progress 287
Table 8.3 Sample holistic rating scales for writing samples (Adapted from a rubric drafted by the ESL Teachers Portfolio Assessment Group, Fairfax County Public Schools, Virginia, O’Malley and Valdez Pierce, 1996, p. 22)
Rating Criteria
6 Proficient Writes single or multiple paragraphs with clear introduction, fully developed ideas and a conclusion Uses appropriate verb tense and a variety of grammatical and syntactical structures; uses complex sentences effectively; uses smooth transitions Uses varied, precise vocabulary Has occasional errors in mechanics (spelling, punctuation and capitalization) which do not detract from meaning
5 Fluent Writes single or multiple paragraphs with main idea and supporting detail: presents ideas logically, though some parts may not be fully developed Uses appropriate verb tense and a variety of grammatical and syntactical structures; errors in sentence structure do not detract from meaning; uses transitions Uses varied vocabulary appropriate for the purpose Has few errors in mechanics which do not detract from meaning
4 Expanding Organizes ideas in logical or sequential order with some supporting detail; begins to write a paragraph Experiments with a variety of verb tenses but does not use them consistently; subject/verb agreement errors; uses some compound and complex sentences; limited use of transitions Vocabulary is appropriate to purpose but sometimes awkward Uses punctuation, capitalization and mostly conventional spelling; errors sometimes interfere with meaning
3 Developing Writes sentences around an idea; some sequencing present, but may lack cohesion Writes in present tense and simple sentences; has difficulty with subject/verb agreement; run-on sentences are common; begins to use compound sentences
2 Beginning Begins to convey meaning through writing Writes predominately phrases and patterned or simple sentences Uses limited or repetitious vocabulary Uses temporary (phonetic) spelling
1 Emerging No evidence of idea development or organization Uses single words, pictures, and patterned phrases Copies from a model Little awareness of spelling, capitalization, or punctuation
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Misinterpretations of holistic rating scales are common sources of impre- cision in judging and reporting students’ performance (e.g., North, 1993). There are ways that reliability can be maximized in the use of rating scales, as will be discussed later in this chapter.
Analytic rating scales
Analytic rating scales differ from holistic scales in that they split up the specified criteria so that markers make a decision about the level of performance on each dimension (or criterion) and then come up with a final score, single composite or profile, by checking across the overall pattern of levels achieved. They are made up of the same number of separate scales as there are distinct components in the construct defi- nition (Bachman and Palmer, 1996). Like holistic scales, analytic scales can be constructed from curriculum or theory-based definitions of language ability. To make decisions on scoring as unambiguous as pos- sible, the weighting of each level for each dimension or criterion can be stipulated. Weighting can be equal, or different percentages are allo- cated. Thus a pattern of weighting can then be used to establish (if this is what is wanted) that the theoretical construct is strongly concerned with language use and less so with vocabulary and accuracy. Grammar, vocabulary, pronunciation and other features of language can be assessed through analytic scales. Purpura (2004, 254) suggests that analytic scoring rather than holistic scoring helps us to estimate the relative contribution of the grammatical (or other) knowledge to the assessment.
In Table 8.4, I have transposed the same criteria from Table 8.3 into sep- arate categories for an analytic rating scale. In this scale, the dimensions are explicitly labelled as ‘Description of the story elements’; ‘The nature of the personal response’; ‘The accuracy of the story description’; and ‘Evaluation of the story’ and presented as separate scales within the ana- lytic scale.
To use this scale, markers circle the descriptors that describe the child’s level on each of the criteria, and from there can make a ‘profile’ report on the child’s ability either by presenting the marked rating scales as a profile or by coming to a decision about the most prominent level. The profile method helps to show in which areas the child is weaker or stronger; thus information set out in this way is more immediately useful for diagnostic purposes than the single score that is obtained from a holistic scale.
288
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Analytic criteria are considered to help teachers and assessors to be less subjective and less prone to variability in their marking than holistic cri- teria. They take more time to complete than global scales, but they provide clear guidance to markers on what they are looking for. It is harder for markers to ignore aspects of performance. The decisions that markers make are also very clearly set out for all to see. They give assessors the
Evaluating performance and progress 289
Table 8.4 Example of an analytic rating scale (adapted from Table 8.3)
Categories of Needs Satisfactory Good Outstanding criteria improvement
Description Describes few Describes Describes Describes of the story story elements some story most story most story elements through oral elements elements elements
or written through oral through oral (characters, language or or written or written settings, drawings language or language or beginning,
drawings drawings middle, and end of story) through oral or written language or drawings
The nature of Makes no Makes a Responds Responds the personal response or limited personally personally response a limited personal to the story to the story
personal response response to to the story the story
The accuracy Provides less Provides an Provides an Provides an of the story than accurate accurate accurate accurate description description description description and detailed
of the story of the story of the story description with some of the story details
Evaluation of States that Explains Analyses Develops the story he or she likes why he or something criteria for
or does not she likes or about the evaluating like the story does not like story (plot, the story
the story setting, character, illustrations)
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
opportunity to acknowledge uneven development of sub-skills in individ- ual children’s performance. Analytic scales are therefore particularly useful for second language learners because uneven performance across different criteria is typically a feature of second language development; this helps to profile learners’ strengths and needs in the performance (Hamp-Lyons, 1991, cited in Weigle 2002, p. 120). In addition, analytic scales have been found to be more useful in rater training, as inexperi- enced raters can understand the criteria more easily when they are set out in separate scales rather than in holistic scales ( Weigle, 2002, p. 120).
A second example of an analytic rating scale, designed to score oral proficiency, is presented in Figure 8.5 to illustrate that not all rating scales are suitable for young learners. The scale describes performance in the lower levels in negative terms, as incorrect and weak: ‘Speech is so halting and fragmentary as to make conversation virtually impossible’; ‘Errors in grammar and word order so severe as to make speech virtually unintelli- gible’. For positive impact, criteria and descriptors for young learners are more suitable when they describe strengths and progress rather than errors. Without positive descriptions of growth, teachers may look for errors rather than instances of growth and resulting negative feedback to children and parents may result in loss of self-esteem and motivation.
Student Oral Proficiency Rating
Student’s name Grade Language observed School City State Rated by Date DIRECTIONS: For each of the 5 categories below at the left, mark an “X” across the box that best describes the student’s abilities.
290
LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5
Compre- Cannot Has great Understands Understands Understands hension understand difficulty most of nearly everyday
even simple following what is everything conver- conver- what is said at at normal sation sation. said. Can slower-than- speed, and normal
compre- normal although classroom hend only speed with occasional discussions ‘social repetitions. repetition without conver- may be difficulty. sation’ necessary. spoken
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Evaluating performance and progress 291
LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5
slowly and with frequent repetitions.
Fluency Speech is Usually Speech in Speech in Speech in so halting hesitant: everyday everyday everyday and frag- often communi- communi- conversation mentary as forced cation and cation and and in to make into classroom classroom classroom conver- silence by discussion discussion discussion sation language is frequently is generally is fluent and virtually limita- disrupted fluent, effortless, impossible. tions. by the with approxi-
student’s occasional mating that search for lapses of a native the correct while the speaker. manner of student expression. searches
for the correct manner of expression.
Vocabulary Vocabulary Misuse Frequently Occasionally Use of limitations of words uses the uses vocabulary are so and very wrong inappro- and idioms extreme as limited words; priate approxi- to make vocabu- conver- terms or mates that conver- lary make sation must of a native sation compre- somewhat rephrase speaker. virtually hension limited ideas impossible. quite because because of
difficult. of inade- inadequate quate vocabulary. vocabulary.
Pronun- Pronun- Very hard Pronun- Always Pronun ciation ciation to under- ciation intelligible, ciation and
problems stand problems though one intonation so severe because necessitate is conscious approximate as to make of pronun- concen- of a definite a native speech ciation tration on accent and speaker’s. virtually problems. the part of occasional unintell- Must the listener inappro- igible. frequently and priate
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Holistic and analytic rating scales can be prepared and used effectively as scoring rubrics for learner language. They can be used for formative and summative purposes. They can be written to reflect the construct being assessed, and to reflect the young learner curriculum. Clarity can be achieved for markers in the descriptions, though samples of work and marker training are needed to maximise reliability and thus fairness. Once they are written, they are practical to use; however, their preparation can take time and collaboration with other teachers and experts is desirable in both their preparation and in their use for marking. If judiciously shared with children, they can give guidance on what is required. Teachers gain valuable professional understandings through such collaboration and marker training. The use of rating scales, backed up with moderation, tends to be accepted by parents and administrators and rating scales are com- monly used in high-stakes testing: ‘ratings provide greater opportunity for
292
LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5
repeat in occasionally intonation order to lead to patterns. be under- misunder- stood. standing.
Grammar Errors in Grammar Makes Occasionally Grammatical grammar and word frequent makes usage and and word order errors of grammatical word order order so errors grammar or word approximate severe as make and word order errors a native to make compre- order which which do speaker’s. speech hension occasionally not obscure virtually difficult. obscure meaning. unintell- Must meaning. igible. often
rephrase or restrict what is said to basic patterns.
Figure 8.5 An example of a scale more suitable for older learners. Student Oral Proficiency Rating (adapted from the Student Oral Language Observation Matrix (SOLOM)) developed by the San Jose (California) United School District ( Thompson, 1997, p. 176).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
assessing the effectiveness of the test task, as well as its impact on test takers. For such reasons, we believe that ratings are well worth their rela- tively high cost in human resources’ (Bachman and Palmer, 1996, p. 200). Rating scales can also be used as reporting scales, that is, they can become reference points for the reporting of achievement to students, parents and others. If they are used as reporting scales, then they need to be evaluated against the user-oriented criteria for reporting scales in Table 8.1.
The following section gives guidance on developing scoring rubrics, and on maximizing reliability in their use.
How scoring rubrics are constructed
In order to prepare scoring rubrics so that they present the most appro- priate criteria and guidelines to arrive at a mark, classroom teachers car- rying out low-stakes assessment can construct their own rubrics according to their own knowledge and expectations of children’s perform- ance, and according to the curriculum requirements. The degree of work required in developing scoring rubrics depends on the purpose for assessment, and in particular how high the stakes are. The following steps can be taken in both low- and high-stakes assessment, though in low- stakes assessment the degree to which the steps are followed will depend on time and collegial support available.
• Determine what the characteristics of quality performance are. (Refer to earlier sections of Chapters 6 and 7 in this book for the- oretical frameworks, genres and contributing skills to inform what these characteristics might be in oral language, reading and writing.)
• Gather sample rubrics that were developed for a similar purpose, for children of a similar age in a comparable learning context.
• Gather samples of children’s work that demonstrate the range of performance from ineffective to very effective.
• Discuss with others the characteristics of these models that dis- tinguish the effective ones from the ineffective ones.
• Write criteria (in levels, if required) for the important characteristics. • Gather another set of samples of students’ work. • Try out the rubrics to see if they help you make accurate judg-
ments about children’s performance (and that you agree with others on these judgments).
• Revise your criteria, if necessary. • Try it again until the score captures the ‘quality’ of the work.
(adapted from Herman et al., 1992, p. 75)
Evaluating performance and progress 293
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Careful checks like this are necessary if the rubrics are to be considered valid and reliable. Decisions on the nature of the criteria, on the degree of detail in the criteria and on the number of levels are an integral part of this process. In high-stakes situations, it is preferable to base scoring rubrics on empirical evidence, rather than simply on agreements about what criteria should be used, and on what ‘looks right’. Empirical evi- dence should be based on actual student performance and may include analyses of actual samples of work, and also analyses of teachers’ com- ments about what constitutes performance at different levels (Greatorex, 2003, p. 127).
How to maximize reliability in scoring
Reliability refers to the extent to which the child would get the same results if another teacher or assessor were to assess their work, or if they were to assess it in the same way again another day, or if they were assessed through different tasks and with different rubrics. Reliability is an essential quality of ‘useful ‘assessment, but the level of reliability needed is depend- ent on the importance of decisions to be made. The amount of time and resources allocated to achieving reliable ratings will be a function of the importance of the decisions to be made. For very low-stakes decisions (e.g., diagnosis), fewer resources need be allocated, and we can settle for lower levels of reliability; for medium to high-stakes decisions (e.g., progress, grades), we need to allocate more resources to ensure reliability.
The following example illustrates the difficulties some teachers have using rating scales without training. At the beginning of a professional development workshop (that is, before teachers had been trained in the use of the scoring rubrics), elementary teachers of French in Australia were asked by their advisory teachers how they would rate an upper ele- mentary student’s writing, reproduced in Figure 8.6. The teachers were given the rubrics and asked ‘Can you indicate the level qualities of this piece of work?’ ( The English translation was not provided for the teach- ers; it is provided for this example only.)
Table 8.5 shows the ratings and comments from the teachers in the pro- fessional development activity. Teachers wrote the notes for themselves, to help them with their discussion on the ratings given and why (as in a moderation session); however, they were kindly submitted and collected by the advisory teachers and are reproduced here to show how different teachers judged the piece of work.
294
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
Evaluating performance and progress 295
This example of teachers’ assessment decisions shows how teachers can make widely different judgments about the qualities of a piece of work, even with the same set of scoring rubrics. It is evident that the constructs that are to be assessed in the piece of work are not at all clear to teachers; they are picking up on different qualities described in different levels, and placing the piece of work at different levels. Teacher 5 has placed more weight on accuracy and rated the piece of work as a low 3.5. Teacher 1 could see many positive features in the text, including cohesiveness and spontaneity, and rated it as a high Level 6. Teacher 2 was looking for col- loquial expressions and register change, following the criteria in Level 6.
Researchers have found that assessment criteria are commonly inter- preted differently by markers (North, 1993; Gipps, 1994; Chalhoub-Deville, 1995). Markers’ ratings can be different for a number of reasons:
• the scoring rubrics may not be valid descriptions
• the scoring rubrics may not be clear enough
• markers may not be familiar enough with the scoring rubrics (and may need more training on how to understand the constructs within the guidelines)
Famous people: Kylie Minogue Elle s’apple Kylie Minogue. Elle est née le mai 1968. Elle est allee Comberville Lycee, Melbourne. Elle aime jouer la comedie. Elle n’aime pas le sport. Elle a une soeur qui s’apple Danii et elle a un frere qui s’apple Brendan. En 1979 gagna le role en The hendersons. Elle aussi Gagna le role de Charlene. En Avril Kylie gagna le TV Logie pour la meilleure actrice pour le role de Charlene. En aout 1986 elle chantait ‘Locomotion’ qui gagn numero uno en Australia a Londre au mois de Septembre 1981 recordai, ‘I should be so lucky’. Elle a dix-neuf ans. Kylie devait choisir entre chanter et jouer. Elle chantait. Elle est chante devant la reine a la Royal Command Performance.
[Grammatical errors cannot be included in the translation]. Her name is Kylie Minogue. She was born in May 1968. She went to Comberville School, Melbourne. She likes to act comedy. She doesn’t like sport. She has a sister who is called Danii and she has a brother who is called Brendan. In 1979 she won a role in The Hendersons. She also won the role of Charlene. In April Kylie won a TV Logie for the best actress for the role of Charlene. In August 1986 she sang ‘Locomotion’ which won number one in Australia and London. In September 1981 she recorded ‘I should be so lucky’. She is 19 years old. Kylie had to choose between singing and acting. She sang. She has sung in front of the Queen at the Royal Command Performance.
Figure 8.6 An upper elementary student’s writing in French, assessed independently by a group of teachers in Table 8.5 (Dodd and Butler, 2002).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
• markers are using their own conceptualization of the criteria, rather than interpreting the written criteria
• markers are bringing unconscious expectations and subjective prefer- ences regarding the relative importance of different criteria
• markers need samples of work to assist them to make a decision
• the scoring rubrics cannot provide sufficiently useful criteria to make a judgment on one piece of work (they are not, after all, intended for this purpose)
• teachers (in formative assessment situations) assume different degrees of support for the learners
If teachers and assessors are to mark in a consistent manner, it is essen- tial that they agree on the meaning and application of the criteria ( William, 1996, cited in Greatorex, 2003, p. 130). The following procedures
296
Table 8.5 Teachers’ ratings of the piece of written work in Figure 8.6 (Dodd and Butler, 2002)
Teacher Mark assigned Teachers’ comments
Teacher 1 6 ‘The student is writing a personal recount/text of Kylie Minogue whereby his/her ideas are developed and presented logically. There are a few errors with grammar, spelling etc. but there is evidence of being able to write a lengthy text with a fair bit of cohesiveness and spontaneity. Use of formal/informal language, frequency of expression, common colloquial expression.’
Teacher 4 5.5 ‘Has tried to generate original language (emerging) with some complex elements. Has used model.’
Teacher 2 5 ‘Use of imperfect/perfect. No problem understanding despite grammatical errors. No really colloquial expressions to warrant level 6. No difference in register but no opportunity to show. Could be “simple cohesive text” re level 4 but I think it’s beyond.’
Teacher 3 Between 3.5 ‘The student has shown manipulation showing and 4.5 structure and linguistic features. The text constructed is
simple and cohesive. It has been taken from a model. Prepositions/tense, recount, sentences/to be/to have.’
Teacher 5 3.5 ‘Linked sentences on a familiar topic, but hasn’t manipulated sentences accurately.’
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
can help to maximize reliability: (1) the careful development of the scoring rubrics; (2) working with markers to reach consensus or agree- ment on the meaning of the rubrics; and (3) training markers.
1. Careful development of scoring rubrics
Reliability of marking is more likely when the scoring rubrics are carefully developed, that is, when assessors follow the steps described above and check their rubrics against the characteristics of effective scoring rubrics listed above.
2. Working with markers to reach consensus
When markers have the opportunity to work together to reach a consen- sus on their scores, this also helps to raise reliability. In classroom assess- ment, for example, a teacher may ask another teacher to check his marking. Two teachers may mark work (or chosen samples of work) inde- pendently and check their findings together. When markers have oppor- tunities to discuss criteria and levels together, and have access to each other’s opinions and experience, especially with reference to work samples, they can come to a shared understanding of the scoring rubrics and the appropriate final scores. Group sessions when markers come together to share their understanding of the scoring rubrics and to reach consensus on final scores are called moderation sessions. A further marker can be asked to assess the same sample of work if there is still a discrepancy, and further moderation sessions may be organized to ensure that markers do not gradually slip away from the agreed under- standings and expectations. These same kinds of procedures are used in formal assessment situations and are supplemented with statistical mod- eration procedures (that is, statistical procedures are used to compare and adjust the patterns of markers’ scores) where high stakes are involved and resources for this are available.
3. Training markers
Markers can be trained through the same kinds of moderation proced- ures described above. In school assessment, training can occur over
Evaluating performance and progress 297
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
time through professional development activities. Markers study the rubrics, then independently give sample texts a score with reference to the scoring rubrics. Some markers are invited to tell others their score, and why they arrived at that score. Other markers do the same, in par- ticular those who disagree, until a consensus is reached. Markers may take the samples with agreed scores away with them to refer to when they are marking.
Hasselgren (2000) reports that in a national assessment project in Norwegian elementary schools, teachers have opportunities for training in a variety of centrally supported classroom assessment and professional development procedures. Teachers are provided with rubrics with levels of performance, on which they are encouraged to write comments. Pupils have their own evaluation sheets. Professional development activities are provided so that teachers are familiar with how to use the material, and how to interpret results. In the project, professional development was found to be both popular and effective, as the following extract from Hasselgren’s account outlines.
Scoring instruments are provided both for pupils and teachers. Pupils’ self-assessment forms are filled in at the end of each subtest. Pupils use a four-point scale (‘yes, mostly, a bit, no’) to rate various aspects of their own performance, salient to the particular macro- skill being tested, as well as their overall performance. They are also asked to rate the material, and to say what they have learnt, using English as far as possible.
The teachers have a range of scoring forms. For the first three sub- tests, scores are entered on a score sheet. However, teachers are encouraged to add comments, which may be influenced by the pupil’s own assessment. For the speaking and writing tests, teachers fill in a profile form for each pupil, choosing one of three level descriptors for each of five different aspects, roughly corresponding to the components of communicative language ability . . . In the case of speaking, observation forms for classroom use are provided . . . There is some correspondence between the questions on pupils’ self- assessment forms and those in the teachers’ materials.
The material is photocopiable and the profile, self-assessment and observation forms are intended to be used on a regular basis in class- room activity. It is anticipated that this ongoing, more comprehen- sive and multiple perspective assessment will yield a more reliable profile of a pupil’s ability than a one-off test battery could ever achieve, besides providing a means of tracking and documenting progress. Moreover, it gives training to both pupils and teachers in the
298
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
areas of assessment, providing them with a metalanguage for describing their language abilities.
Teachers are instructed not only in how to use the material, but also on how to interpret results and, equally importantly, on how to act on them to develop their pupils’ skills. It is emphasized that scores and profiles should only be regarded as an indication of ability, which should be pursued further. The assessment results should always be interpreted alongside the pupils’ self-assessment comments.
While the handbook is intended to be self-instructing, it is recog- nized that teachers benefit from active training in the use of the test material and accompanying forms. For this reason, county authorities have been invited to arrange in-service training courses in assessment using the EVA material. This activity is currently ongoing and has proved to be popular among authorities and teachers alike.
(Hasselgren, 2000, p. 266)
In teacher assessment situations, professional development activities such as those created in the Norwegian project make reliability more likely. However, as is recognized in the project, the appropriate construc- tion of scoring rubrics remains as a central condition for reliability.
Standards
As I described in Chapter 1, standards are descriptions of curriculum out- comes, usually described in stages of progress; they give descriptions of how much, or at what level, students need to perform to demonstrate achievement of the content standards, or descriptions of what students should know and be able to do. There is a growing body of literature on standards for language learners (Brindley, 1998; Clair, Adger, Short and Millen, 1998; McKay, 2000) to which readers can refer. The following section attempts to provide some sense, in summary, of the issues in the development, use and evaluation of standards for young second lan- guage learners.
The development and use of standards
There are different ways that standards are presented (see, for example, the three standards presented below). Some standards are lists of bullet points of curriculum-related outcomes, some are set out as staged
Evaluating performance and progress 299
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
descriptions of developing ability. Some are organized around goal state- ments (‘To use English to achieve academically in all content areas: Students will use appropriate learning strategies to construct and apply academic knowledge’ ( TESOL, 1997, p. 91), and others around language use activities such as listening, speaking, reading and viewing, and writing. Some stand alone, others are accompanied by exemplar tasks, and others by vignettes and instructional sequences. The differences depend on a variety of factors such as their purpose and philosophical framework.
Standards are used for a variety of purposes, including the following:
(1) to establish expected standards of achievement; (2) to provide system-wide reference points to assist teachers in
assessing individual progress; (3) to provide a common framework for curriculum development; (4) to provide more comprehensive information for reporting to inter-
ested parties outside the classroom, such as parents, employers, and educational authorities;
(5) to provide a basis for identifying needs and targeting resource allocation;
(6) to clarify the kinds of performance that lead to academic success; (7) to provide a resource for teacher professional development.
(adapted from Brindley, 1998, pp. 49–50)
One advantage of standards is, amongst others, that assessments can be closely aligned with instruction. Teachers and assessors can design assessments with reference to the outcomes statements and can compare students’ performance with the statements to evaluate that performance. The standards provide a common reference point for teachers, assessors and administrators, facilitating communication about learning and progress. Standards also provide a rational and objective basis for admin- istrators’ decisions about programme needs and resource allocation (Brindley, 1998, p. 52).
There are different ways that standards are developed, depending on their purpose. Some are developed a priori, meaning that they are designed without reference to empirical data, but more with reference to the curriculum requirements and what is generally expected of students as they progress through their stages of learning. Some standards are designed a posteriori, that is, they are designed following examination of the outcomes that students actually achieve. There are examples of both in the three examples of standards below. The advantage of the former is that the outcomes are usually closely aligned with the curriculum, though
300
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
they may be developed primarily through subjective and experience- based judgments. The advantage of a posteriori development is that there is less likely to be a mismatch between developmental pathways and the descriptions contained in the standards since empirical and psychomet- ric methods are used. A disadvantage, however, is that there may not be a close match in the stages and descriptors with the published cur- riculum (Sadler, 1987, p. 3). In some cases, a combination of approaches is used.
How is students’ performance assessed and reported against standards? In many situations around the world (e.g., the United States), standard- ized tests are used; that is, large-scale, statistically normed tests, devised to assess students’ performance against the official standards, are admin- istered to cohorts of students across a state or nation. In other situations (e.g., England and Wales), standard assessment tasks are administered by teachers and marked by external assessors at certain assigned stages of schooling. In other countries around the world (e.g., Australia), classroom teachers are expected to observe and record progress against the stan- dards over time and, through their reporting, pass on to parents and administrators a report on children’s progress against the state standards. (However, in Australia, there has been a gradual introduction of other national standardized tests designed for accountability purposes.)
The impact of standards on young second language learners
Standards can have a positive impact in that they can give administrators and teachers a common understanding about what should be taught; parents can become clearer about what should be taught and learned, and they can find out how their child is progressing in relation to other children and the expected progress of children of this age. However, there can also be a negative impact when parents begin to believe their child is ‘falling behind’, when in fact the issue may be a developmental ‘lag’, common with young children, who often ‘catch up’ at a later stage of development. Young second language learners may also be behind because they are being assessed against unrealistic and invalid first lan- guage learner expectations. Standards may have other negative impacts, especially when they are tied to external tests (McKay, 2004). Teachers may become obliged to teach to the standards, instead of to the develop- mental needs of children; the curriculum may become assessment- driven; assessment requirements take up time that could be devoted to
Evaluating performance and progress 301
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
teaching. Teachers can become focused on the need to move children up the scale, rather than to meet current needs (McKay, forthcoming). Lastly, teachers can be judged accountable based on the results of tests tied to standards. This accountability measure is believed by some administra- tors to cause teachers to teach ‘even harder’ to advance the children along the scale. However, it can cause stress amongst teachers and a loss of indi- vidualized teaching appropriate to children’s needs. Children come to teachers from a range of different starting points and developmental levels, and they develop at different rates and have different needs.
There are continuing issues to be resolved concerning tensions in the use of standards and of the impact of standards and accompanying testing regimes on learners. Research into the impact of standards was discussed in Chapter 3; there is a continuing need for research in the area, particularly in relation to the impact of standards and their accompany- ing external tests on young learners.
Three examples of performance standards
This section briefly describes three performance standards and evaluates them using the framework for evaluating scoring rubrics and reporting scales in Table 8.1 above.
Example 1: The Illinois Foreign Language Learning Standards
The foreign language learning standards of the Illinois State Board of Education are a good example of standards mandated by a large educa- tional authority. These standards set out the expectations of content and performance of learners being taught through the state curriculum. As is common in an education department’s standards documents, these standards combine both the content and performance standards within one document. The statement of what should be learned also becomes the statement of criteria (the descriptors) that guide teachers’ and asses- sors’ assessment of what has been learned. Standards are provided under three goals across five stages. Table 8.6 shows the Learning Standards for foreign language learning under the first goal, Goal 28 Communication. A general learning standard is listed on the left, and then more specific standards are listed (with connection by numbers to content standards) across the page in five stages of progress.
302
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
303
Table 8.6 Illinois Foreign Language Learning Standards: Standards under Goal 28 Communication (Illinois State Board of Education, 2003)
As a result of their schooling students will be able to:
Learning standards Stage 1 Beginning Stage 2 Beginning Stage 3 Intermediate Stage 4 Advanced Stage 5 intermediate intermediate Advanced
A. Understand oral 28.A.1a Recognize 28.A.2a Comprehend 28.A.3a Comprehend main communication in basic language illustrated stories, messages of simple oral and the target language patterns (e.g., forms audiovisual programs or audio presentations with
of address, websites assistance from resources questions, case) (e.g., glossaries, guided
28.A.2b Follow questions, outlines) 28.A.1b Respond instructions in the target appropriately to language, given one step 28.A.3b Follow instructions single commands in at a time, for a wide in the target language as the target language range of activities given in multistep segments
for assignment and activities in and out of the classroom
term s o
f u se, availab
le at h ttp
s://w w
w .cam
b rid
g e.o
rg /co
re/term s. h
ttp s://d
o i.o
rg /10.1017/C
B O
9780511733093.009 D
o w
n lo
ad ed
fro m
h ttp
s://w w
w .cam
b rid
g e.o
rg /co
re. W ayn
e State U n
iv Lib raries, o
n 15 D
ec 2020 at 17:39:40, su b
ject to th
e C am
b rid
g e C
o re
304
Table 8.6 (continued)
As a result of their schooling students will be able to:
Learning standards Stage 1 Beginning Stage 2 Beginning Stage 3 Intermediate Stage 4 Advanced Stage 5 intermediate intermediate Advanced
B. Interact in the 28.B.1a Respond to 28.B.2a Pose questions 28.B.3a Respond to target language and ask simple spontaneously in open-ended questions in various settings questions with structured situations and initiate communication
prompts in various situations 28.B.2b Produce language
28.B.1b Imitate using proper 28.B.3b Produce language pronunciation, pronunciation, with improved pronunciation, intonation and information and inflection intonation and inflection inflection including sounds unique in 28.B.2c Comprehend 28.B.3c Use appropriate the target language gestures and body non-verbal cues common
language often used in in areas where the target everyday interactions in language is spoken the target language
C. Understand 28.C.1a Recognize 28.C.2a Comprehend 28.C.3a Comprehend the written passages the written form directions, read simple main message of a variety in the target of familiar spoken passages, infer meaning of written materials with language use language and of cognates and the help of resources setting predict meaning of recognize loan words (e.g., dictionary, thesaurus,
key words in a software, Internet, e-mail)
term s o
f u se, availab
le at h ttp
s://w w
w .cam
b rid
g e.o
rg /co
re/term s. h
ttp s://d
o i.o
rg /10.1017/C
B O
9780511733093.009 D
o w
n lo
ad ed
fro m
h ttp
s://w w
w .cam
b rid
g e.o
rg /co
re. W ayn
e State U n
iv Lib raries, o
n 15 D
ec 2020 at 17:39:40, su b
ject to th
e C am
b rid
g e C
o re
305
simple story, poem 28.C.2b Decode new to expand vocabulary or song. vocabulary using
contextual clues and 28.C.3b Compare word 28.C.1b Infer drawing on words and use, phrasing and sentence meaning of cognates phrases from prior lessons structures of the target from context language with those used in
one or more other languages D. Use the target 28.D.1a Copy/write 28.D.2a Write on familiar 28.D.3a Write compositions language to words, phrases and topics using appropriate and reports with a specific present information, simple sentences grammar, punctuation focus, supporting details, concepts and ideas and capitalization logical sequence and for a variety of 28.D.1b Describe conclusion purposes to people, activities and 28.D.2b Present a simple different audiences objects from school written or oral report on 28.D.3b Present findings
and home familiar topics from research on unfamiliar topics (e.g., the Roman army,
28.D.2c Present an original the French chateaux, origins production (e.g., TV of chocolate) commercials, ads, skits, songs) using known 28.D.3c Present a simple, vocabulary and original poem or story grammatical structures based on a model
term s o
f u se, availab
le at h ttp
s://w w
w .cam
b rid
g e.o
rg /co
re/term s. h
ttp s://d
o i.o
rg /10.1017/C
B O
9780511733093.009 D
o w
n lo
ad ed
fro m
h ttp
s://w w
w .cam
b rid
g e.o
rg /co
re. W ayn
e State U n
iv Lib raries, o
n 15 D
ec 2020 at 17:39:40, su b
ject to th
e C am
b rid
g e C
o re
The Illinois standards were written a priori, as many education depart- ment standards are written, that is, the writers have set out hoped-for outcomes, rather than nominate outcomes based on research data of learner performance. The layout of these standards reflects their admin- istrative purpose – they are designed to collect system-level data on stu- dents’ progress. However, in order to clarify the meaning of the standards for teachers, the state has provided other material such as sample class- room assessment tasks and performance descriptors. These are for use at the local level, on a voluntary basis, to support classroom teaching and assessment. The first three stages in these standards reflect to a limited extent the types of abilities pertinent to young learner foreign language learning, for example ‘Copy/write words, phrases and simple sentences’, but probably because of the need to cater for all age groups in the early stages, the characteristics of young learners are not prominent in these descriptions. Classroom assessments and performance descriptors may elaborate young learner performance criteria beyond what is shown in this broad statement of the standards. However, a separate set of state- ments for young language learners, reflecting the special nature of their language learning, and language learning outcomes, would ensure that the standards reflected the construct of assessment. Those devising tests would be better guided towards testing appropriate for young learners. Those monitoring children’s progress in the classroom would have less ambiguity to deal with in the descriptions. Nevertheless, as I discussed above, the minimal nature of the descriptors would be likely to continue to present a challenge for teachers involved in day-to-day assessment. Questions regarding the positive impact on the learning of young learn- ers of these standards (as they stand without specific statements for young learners), and whether teachers’ professional understandings and involvement in the learning process are enhanced through the use of these standards, with their accompanying materials, can best be answered through research.
For users, parents, other teachers and administrators, it is likely that the standards present meaningful descriptions of progress. The descrip- tors and levels more than likely provide valuable information for users. A report would be generated from these standards that will give parents an understanding of their child’s progress through the required curricu- lum, and (even though performance assessment avoids normative judg- ments) an idea of their child’s progress in relation to that of others of their age group. A report can also provide administrators with an overview of the performance of cohorts of students, individual schools and individual
306
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
teachers if needed. The impact of the standards on users is likely to be positive, since there is possibly a greater understanding of the nature of the curriculum.
Example 2: The Common European Framework of Reference
The Common European Framework of Reference ‘provides a common basis for the elaboration of language syllabuses, curriculum guidelines, examinations, textbooks etc. across Europe’. It defines levels of profi- ciency ‘which allow learners’ progress to be measured at each stage of learning and on a life-long basis’ (Council of Europe, 2001). The purpose for this framework is to bring together the different language learning programmes in Europe by setting out a common understanding of path- ways and levels. There is a global scale (Figure 8.7), and illustrative scales for different skills, an example of which is presented in Figure 8.8. These scales can be used to organize and monitor progress across groups and programmes, as well as to check individual learner progress. For our pur- poses, the earlier levels of proficiency only are presented in Figure 8.6 and Table 8.5, since the upper levels move into the type of language perform- ance expected of adults. The levels were developed mainly a posteriori, in that the development of the final descriptors were based on empirical analysis of teachers’ perceptions of development (see North, 1995), and on theories of language competence, made explicit throughout the Common Reference Levels publication (Council of Europe, 2001).
Many illustrative scales are also provided in the document. Illustrative scales give more detail about levels of progress in specific skills. Figure 8.8 shows the illustrative scale for ‘Interviewing and being interviewed’.
Those who constructed the Framework of Reference did not write a specific scale for young learners. Many of the descriptors in the ‘Common Reference Levels: global scale’ are inappropriate for younger learners beyond B1 Independent User as the topics of discussion become more abstract, specialized and technical, and texts become longer and more detailed. The first three levels are suitable for use with young foreign lan- guage learners, though they are skeletal, designed to describe develop- ment of beginning learners of all ages. The descriptor in Level C2 of the Common Reference Scales expects that the language user ‘Can keep up his/her side of the dialogue extremely well, structuring the talk and inter- acting authoritatively with complete fluency as interviewer or intervie- wee.’ In the illustrative scale above, learners at B1 are able to ‘use a
Evaluating performance and progress 307
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
prepared questionnaire to carry out a structured interview’. Young learn- ers might therefore be restricted to the lower levels of the Common Framework Scales, unless ways of describing advanced proficiency can be found that do not require advanced cognitive and social skills. Children might connect back into the same or further levels (B2 and
308
Proficient User C2
C1
Independent user B2
B1 Can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure etc. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topics which are familiar or of personal interest. Can describe experiences and events, dreams, hopes and ambitions and briefly give reasons and explanations for opinions and plans.
Basic user A2 Can understand sentences and frequently used expressions related to areas of most immediate relevance (e.g., very basic personal and family information, shopping, local geography, employment). Can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters. Can describe in simple terms aspects of his/her background, immediate environment and matters in areas of immediate need.
A1 Can understand and use familiar everyday expressions and very basic phrases aimed at the satisfaction of needs of a concrete type. Can introduce him/herself and others and can ask and answer questions about personal details such as where he/she lives, people he/she knows and things he/she has. Can interact in a simple way provided the other person talks slowly and clearly and is prepared to help.
Figure 8.7 First three levels of Common Reference Levels: global scale (Council of Europe, 2001, p. 24).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
onwards) on the original scale as they advance in cognitive maturity and in their foreign language learning.
Therefore work is required by elementary language educators to adapt the Common Reference levels to young foreign language learners. Hasselgren’s (2005) scales for young learners, developed within the framework of the Common Reference Scales, are of interest in this regard. The value of a common framework such as the Common Reference Scales is very great, and educators of young learners have a common reference point from which to build up a set of standards for young learners that can be adapted for different languages, and that can later connect into levels for older learners.
The Common Framework Scales reflect their stated purpose, providing a basis for the elaboration of textbooks and programmes, and for the mon- itoring of learners’ progress at each stage of learning, though this purpose is not, in reality, attained for young learners without further research and interpretation. The construct to be assessed, language proficiency, is well
Evaluating performance and progress 309
B1 Can provide concrete information required in an interview/consultation (e.g., describe the symptoms to a doctor) but does so with limited precision.
Can carry out a prepared interview, checking and confirming information, though he/she may occasionally have to ask for repetition if the other person’s response is rapid or extended.
Can take some initiatives in an interview/consultation (e.g., to bring up a new subject, but is very dependent on interviewer in the interaction.
Can use a prepared questionnaire to carry out a structured interview, with some spontaneous follow up questions.
A2 Can make him/herself understood in an interview and communicate ideas and information on familiar topics, provided he/she can ask for clarification occasionally, and is given some help to express what he/she wants to.
Can answer simple questions and respond to simple statements in an interview.
A1 Can reply in an interview to simple direct questions spoken very slowly and clearly in direct non-idiomatic speech about personal details.
Figure 8.8 Example of the levels A1–B1 of an illustrative scale: interviewing and being interviewed. Common European Framework of Reference (Council of Europe, 2001, p. 82).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
outlined in theoretical terms, and empirical procedures used help to ensure that the scales reflect this construct. However, the construct of lan- guage proficiency for younger learners is not sufficiently addressed; young learners are not mentioned throughout the document and it there- fore must be assumed that they are not included in the learner group. The lack of attention to young learners means that young learners’ needs may not be met by teachers who use the scales to guide their assessment, and teachers’ professional understandings about young learners’ language learning may not be enhanced. Similarly, users involved with young learn- ers (young learners themselves, parents, teachers, curriculum developers, administrators) may find that these scales do not meet their needs unless there is further research and development targeted towards the language learning of young learners.
In summary, the Common European Framework has provided lan- guage educators in Europe with an important theoretically and empiri- cally based framework, but young learners would have benefited most if their particular learner characteristics and needs had been included in the original project.
Example 3: The Australian (NLLIA) ESL Bandscales
A set of Australian ESL Bandscales (McKay, Hudson, and Sapuppo, 1994), known in Australia as the NLLIA (National Languages and Literacy Institute of Australia) ESL Bandscales, are examples of performance stan- dards. Their purpose is to provide ESL teachers with a common reference tool for assessment and planning for ESL learners in ESL and mainstream programmes. They are written in the form of holistic rating scales in Listening, Speaking, Reading and Writing. The ESL Bandscales reflect the multiple entry points of ESL learners into Australian schools (see Figure 8.9). There are three Bandscales reflecting three broad age groups; junior primary (approximately ages 5–7), middle/upper primary (approxi- mately ages 8–11) and secondary (approximately ages 12–18). The scales describe progress from beginning to advanced English language devel- opment in the mainstream context for each of these broad age groups. There are separate junior primary and middle/upper primary scales and secondary scales, giving a strong profile to young learners’ language learning pathways. A young learner who is near-equivalent to native- speaking children of his/her own age, in the upper levels of the junior primary and middle/upper primary Bandscales, is therefore not ‘made
310
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
invisible’ as they move closer to but have not yet attained native speaker- like language abilities.
In Figure 8.10, Levels 4 and 5 of Junior Primary Listening are repro- duced as examples of two levels for young learners in the ESL Bandscales. The first bolded statement is an overall statement designed for first refer- ence. The remainder of the descriptors follow a theoretically based struc- ture outlined in the introduction to the materials (McKay, Hudson and Sapuppo, 1994, p. A27).
The purpose of the NLLIA ESL Bandscales was primarily to improve teachers’ professional understandings about second language learning, and therefore to inform their teaching and assessment decisions as they work with mainstream teachers in Australian schools. They were written with this clear purpose in mind, and therefore are not written in bullet- point lists, include more ‘messages’ for teachers, and describe, in holistic descriptions, the typical language behaviour of children moving through the levels. They include reminders about the characteristics of second language acquisition of young learners (e.g., of the possible presence of the silent period), and reminders about the role of the first language in second language learning. They reflect the cognitive demand and the maturity of each broad age group, and also the types of tasks that young learners are expected to carry out in their mainstream classrooms. Because of this clarity of purpose, the ESL Bandscales can be said to be appropriate for young learners, to reflect their purpose, and their construct (that is, the
Evaluating performance and progress 311
Figure 8.9 The three levels of the Australian (NLLIA) ESL Bandscales (McKay, Hudson and Sapuppo, 1994, p. A17) accommodating multiple entry points.
M a
in st
re a
m L
a n
g u
a g
e &
L it
e ra
cy D
e ve
lo p
m e
n t
ESL Bandscales
ESL secondary bandscales 1 2 3 4 5 6 7 8
ESL middle/upper primary bandscales
1 2 3 4 5 6 7 8
ESL junior primary bandscales 1 2 3 4 5 6 7 8
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
312
Listening: Level 4
Are able to comprehend social English in familiar contexts (e.g., in general school contexts: in classroom interaction around activities, in playground interactions, on excursions etc.) with ease, with only occasional help given by the interlocutor. Are able to follow instructions within a classroom learning activity if explained and presented clearly (i.e. with clear steps, modelling of the task, logical sequencing of steps) though will often rely on further repetition of instructions on a one-to-one or small group basis.
Require intensive concentration to comprehend fully. Are likely to lose comprehension with high background noise present (e.g., other children talking). May use strategies which give the impression that full comprehension has taken place (e.g., nodding; smiling; copying actions of others: silence) which can be misleading in learning activities.
Need time for processing of language experienced (e.g., before having to answer a question; during teacher talk; during class discussion). Have short concentration span if topic of lesson is unfamiliar.
Will lack precision in understanding, e.g. will miss many details of the language they hear, e.g. may not understand a wide range of prepositions, e.g. between, below, beneath and will have difficulty with complex structures, e.g. although . . . how often . . . etc. Are restricted by a limited vocabulary.
Listening: Level 5
Are able to comprehend English in a range of social contexts pertinent to their age level. Are less dependent on extra help from the interlocutor, and have little need to ask for repetition or reformulation, especially if the topic is familiar. Will comprehend main points and most detail in learning activities on familiar topics if activities are language-focused (i.e. teacher is aware of language demands of the task); will continue to have some difficulty comprehending extended teacher talk at normal speed and with more complex ideas in learning activities when they are expressed through complex language.
Can comprehend gist of new topic-specific language if contextual and language support is given, and time is allowed for processing. Will miss some specific details because of lack of ‘depth’ of language, e.g. limited range of vocabulary, lack of understanding of complex structure and relationships such as degrees of certainty/uncertainty (i.e. modality) e.g. (might, could), problem/solution (if. . . then), before and after, compare/contrast (similar to; different from).
Lapses in comprehension of spoken texts can be caused by gaps in vocabulary, overload of new vocabulary, and gaps in concepts because of previous lapses in understanding. May lose the thread once a lapse occurs.
May lose concentration if topic and language of the lesson are unfamiliar.
Figure 8.10 Extract from The Australian (NLLIA) ESL Bandscales ( Junior Primary Listening Levels 4 and 5).
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
language learning of young learners in the mainstream). They can be said to promote a reasonably fair assessment for young learners.
For markers (that is, mostly ESL teachers in schools, often in conjunction with mainstream teachers), the ESL Bandscales can be said to avoid ambi- guity in the sense that the clarity of purpose, the assessment of young learn- ers in mainstream classes, helps observation of learning and assessment of learning outcomes. The validity and reliability of the ESL Bandscales, the assessment practices of ESL teachers around the ESL Bandscales, and the impact of the ESL Bandscales on teachers’ professional understandings is currently under review. Research to date in Australia, together with anecdo- tal feedback, strong sales of the document and adaptation of the materials for indigenous learners, indicates that many ESL teachers use and can relate to the ESL Bandscales (Breen et al., 1997; Davison and Williams, 2002).
In contrast, there are mixed signals about the value of the ESL Bandscales for users. Whilst ESL teachers (the markers) have found them to be appropriate and useful, ESL teachers have needed to simplify the descriptions for the mainstream classroom teachers they work with. They have also needed to adapt and simplify them for students and parents. This is not necessarily a problem, since the adaptations can be carried out to suit the particular learning groups and learning contexts. At the system level, although some education systems have taken them up for administrative purposes, many education departments have chosen to use more administrative national or state-developed outcomes-based ESL frameworks rather than adopt the ESL Bandscales, because they are seen not to meet the needs of administrators, nor to follow the require- ments of outcomes-based reform. They have, however, been taken up for administrative purposes by smaller, more flexible education depart- ments. In relation to their impact on learners and their parents, it is believed that the impact of the NLLIA ESL Bandscales on ESL learners and their parents has been positive, firstly because of the increased understanding on the part of teachers about ESL learner progress, and secondly because students and parents can understand progress in pos- itive terms against second language pathways, rather than only in nega- tive comparison with first language literacy development.
Summary
Teachers and assessors need to determine the scoring method they will use; to do this, firstly they need to decide on the criteria by which
Evaluating performance and progress 313
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core
children’s responses are evaluated, and secondly to decide on how to arrive at a score. Criteria tell teachers and assessors what they should be looking for in a learner’s performance and operationalize the construct that is to be assessed.
Good scoring rubrics reflect the construct(s) to be assessed and are created by teachers and assessors who are familiar with the curriculum, how the curriculum is taught and the characteristics of the young learn- ers in question. They also give clear guidelines on how to arrive at a score. In language use assessment, it is difficult for criteria to be specific, and even when they are, they are interpreted differently by teachers because of their different experience and background, and orientation to the subject and the children. There are therefore a number of strategies that need to be followed to try to ensure reliability of assessment, including the careful development of rubrics, working with markers to reach con- sensus and training markers. It may be possible to use correct/incorrect or partial credit scoring, depending on the type of question. A number of principles are given in the chapter to guide the evaluation of scoring rubrics and reporting scales for young learners; scoring rubrics for young learners generally take the form of observation checklists, task-based cri- teria sheets and holistic and analytic rating scales. Rating scales may also be used as reporting scales. Each of these has strengths and weaknesses. Scoring rubrics are best constructed by groups of professionals working together and are best based on empirical evidence rather than simply on agreements about what ‘looks right’.
Many education departments around the world have introduced content and performance standards setting out what should be taught and assessed over time. Professional bodies have also produced perfor- mance standards setting out expected stages of growth. There are a number of ways that standards are constructed. Performance standards are a type of reporting scale and their effectiveness can be evaluated using considerations appropriate to reporting scales. Three examples of language standards from around the world show different ways that the construction of standards has been approached, how well this has been done and in particular how they meet the requirement that the charac- teristics of young learners’ learning are addressed.
314
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511733093.009 Downloaded from https://www.cambridge.org/core. Wayne State Univ Libraries, on 15 Dec 2020 at 17:39:40, subject to the Cambridge Core