homework
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/314894221
Test Construction and Evaluation: A Brief Review
Article in Indian Journal of Applied Research · June 2015
CITATIONS
0 READS
20,969
1 author:
Shafaat Hussain
Newcastle University
20 PUBLICATIONS 49 CITATIONS
SEE PROFILE
All content following this page was uploaded by Shafaat Hussain on 13 March 2017.
The user has requested enhancement of the downloaded file.
INDIAN JOURNAL OF APPLIED RESEARCH X 725
Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555XReseaRch PaPeR
Test Construction and Evaluation: A Brief Review
Shafaat Hussain Sumaiya Sajid Assistant Professor of Communication, Madawalabu
University, Bale-Robe, Ethiopia Assistant Professor of English, Falahe Ummat Girls PG
College, Bhadohi, Uttar Pradesh, India
English
Keywords test construction, test evaluation, item analysis
ABSTRACT Beginning from intuitive via scientific, today we are in communicative era of testing. The pursuit for pro- fessionalism is evidenced by a host of standards or codes of practice which have been developed, imple-
mented and enforced by testing organizations from all over the world. Creating professionally sound assessment re- quires both art and science. Engaging in fair and meaningful assessments and producing relevant data about students’ achievement is an art. Designing a test, formulating item, and processing grade is a complete science. This review article reports a survey of test construction; its cyclic formulation process; the phases that it involves (deciding content, specifying objectives, preparing table of specification and fixing items); and the way it is evaluated (item difficulty and item discrimination). Teaching and testing are inseparable and in order to be professionally sound in judging the stu- dent’s performance, it is significant to know the norms, standards and ethics of test construction and evaluation.
1. INTRODUCTION Good testing practice has been discussed very extensively in language testing literature, and has been approached from different perspectives by language testing researchers. A com- mon approach to addressing this issue, for example, is to dis- cuss how a language test should be developed, administered, and evaluated (Alderson, Clapham and Wall 1995; Li 1997; Heaton 2000; Fulcher 2010). These discussions are primarily fo- cusing on good practice in each and every step in the testing cycle, including, for instance, test specifications, item writing, test administration, marking, reporting test results, and post hoc test data analyses. Another common approach to discuss good testing practice is to focus upon one particular dimen- sion of language testing, to develop theoretical models about this particular dimension, and then to apply those theoretical models on language testing practice. For example, Bachman and Palmer (1996) developed a model of ‘test usefulness’, which, as they argued, was ‘the most important consideration in designing and developing a language test’. Other exam- ples adopting this approach are Cheng, Watanabe, and Cur- tis (2004), focusing on test washback, Kunnan (2000, 2004) on test fairness, Shohamy (2001a, b) on use-oriented testing and the power of tests, and McNamara and Roever (2006) on the social dimensions of language testing. Good testing practice has also been considerably documented in the standards or codes of practice which have been developed by testing or re- search organizations from all over the world. For example, the ILTA guidelines (2007), the ALTE Code of Practice (1994), the EALTA Guidelines (2006), ETS Standards for Quality and Fair- ness (2002) and the like (Boyd and Davies 2002; Fulcher and Davidson 2007; Bachman and Palmer 2010).
Figure 1: Cyclic process of teaching and testing
2. TEST CONSTRUCTION Ideally, effective tests have some characteristics. They are valid (providing useful information about the concepts they were designed to test), reliable (allowing consistent meas- urement and discriminating between different levels of per- formance), recognizable (instruction has prepared students for the assessment), realistic (concerning time and effort required to complete the assignment) practical and ob- jective. To achieve these, the teacher must draw up a test blue print or plan of specifying the objectives, preparing table of specification, allocating the test length as per time limit, and deciding the types of items to be set (Wiggins 1998; Svinicki 1999).
Figure 2: Stages in Test construction
It should include details of test content in the spe- cific course. Moreover, each content area should be weighted roughly in proportion to its judged impor- tance. Usually, the weights are assigned according to the relative emphasis placed upon each topic in the textbook. The median number of pages on a given topic in the prescribed books is usually considered as an index of its importance. To devise a classroom tests, the advice and assistance of fellow teachers can prove to be of immense importance (Wiggins 1998; Riaz 2008).
1.2 Specifying the Objectives Each subject demands a different set of instructional ob- jectives. For example, major objectives of the subjects like
726 X INDIAN JOURNAL OF APPLIED RESEARCH
Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555XReseaRch PaPeR
sciences, social sciences, and mathematics are: knowledge, understanding, application and skill. On the other hand the major objectives of a language course are: knowledge, comprehension and expression. Knowledge objective is considered to be the lowest level of learning whereas un- derstanding, application of knowledge is considered high- er levels of learning. As the basic objectives of education are concerned with the modification of human behavior, the teacher must determine measurable cognitive out- comes of instruction at the beginning of the course. The test determines the extent to which the objectives have been attained, both for the individual students and for the class as a whole. Some objectives are stated as broad, general, long-range goals, e.g., ability to exercise the mental functions of reasoning, imagination, critical appre- ciation. These educational objectives are too general to be measured by classroom tests and need to be operationally defined by the class teacher (Wiggins 1998; Riaz 2008).
2.3 Preparing Table of Specifications A table of specifications is a two-way table that represents along one axis the content area/topics that the teacher has taught during the specified period and the cognitive level at which it is to be measured, along the other axis. In other words, the table of specifications highlights how much emphasis is to be given to each objective or topic. While writing the test items, it may not be possible to at- tempt to adhere very rigorously to the weights assigned in each cell. Thus, the weights indicated in the original ta- ble may need to be slightly changed during the course of test construction, if sound reasons for such a change are encountered by the teacher. For instance, the teacher may find it appropriate to modify the original test plan in view of data obtained from the experimental try-out of the new test (Wiggins 1998; Riaz 2008).
2.4 Deciding Test Length The number of items that should constitute the final form of a test is determined by the purpose of the test or its pro- posed uses, and by the statistical characteristics of the items. Three important considerations in setting test length are: (i) the optimal number of items for a homogenous test is lower than for a highly heterogeneous test; (ii) items that are meant to assess higher thought processes like logical reasoning, creativity, abstract thinking etc., require more time than those dependent on our ability to recall important information and (iii) another important consideration in determining the length of test and the time required for it is related to the validity and reliability of the test. The teacher has to determine the number of items that will yield maximum validity and reliabil- ity of the particular test (Wiggins 1998; Riaz 2008).
2.5 Fixing Types of Items Each type of exam item has its advantages and disadvan- tages in terms of ease of design, implementation, and scoring, and in its ability to measure different aspects of students’ knowledge or skills. Multiple choice and essay items are often used in college-level assessment because they readily lend themselves to measuring higher order thinking skills (e.g., application, justification, inference, analysis and evaluation). Yet instructors often struggle to create, implement, and score these items (Worthen et al. 1993; Wiggins 1998; McMillan 2001). Here, an attempt would be made to examine the guidelines to be followed while designing major types of items like true-false, gap- filling, matching, multiple-choice and essay types.
2.5.1 Constructing True-False Items While constructing true-false items, attempts should be
made to avoid trivial, broad, general and negative state- ments. When a negative word is necessary and cannot be ignored, it should be underlined or put in italics so that students do not overlook it. Second, it is better not to include two ideas in one statement unless there is cause- effect relationship. Third, those opinions should not be used which are attributed to some sources, or the ability to identify opinion is being specifically measured. Fourth, true or false statements should be equal in length. Fifth, there should be proportionate numbers of true and false statements and finally, statements should be simple in lan- guage and understanding (Gronlund and Linn 1990; Chase and Jacobs, 1992; Wiggins 1998; McMillan 2001).
2.5.2 Constructing Completion/Gap Filling Items While constructing the completion/gap filling items, an at- tempt should be made to word the item so that the re- quired answer is both brief and specific. A direct question is generally more desirable than an incomplete statement. Direct statements from textbooks should not be taken as an item. If the answer is to be expressed in numeri- cal units, we should indicate the type of answer wanted. Blanks for answers (gap filling space) should be equal in length and in a column to the right of the question. In- cluding too many blanks in one statement is not advisable (Gronlund and Linn 1990; Chase and Jacobs, 1992; Wig- gins 1998; McMillan 2001).
2.5.3 Constructing Matching Items While constructing matching items, homogeneous ma- terial in a single matching exercise should be used. It is advised to include an unequal number of responses and premises and instruct the student that responses may be used once, more than once, or not at all. Brief portion should be kept in the left column, and shorter responses should be placed on the right. We should arrange the list of responses in logical order. Care should be made on placing words in alphabetical order and numbers in sequence. There must be indications in the directions the basis for matching the responses and premises. We should stay away from ambiguity so that testing time dur- ing examination may be saved and finally, care must be taken in placing one matching exercise on the same page (Gronlund and Linn 1990; Chase and Jacobs, 1992; Wig- gins 1998; McMillan 2001).
2.5.4 Constructing Multiple-choice items Stem and choices are the two parts of a multiple-choice item. A stem should be clearly described and should be complete in itself. The options provided should be as short as possible. Only that information should be placed in the stem which needs to make the problem clear and specific. The stem of the question should communicate the nature of the task to the students and present a clear problem or concept. The stem should provide only information that is relevant to the problem or concept, and the options (dis- tracters) should be succinct. Avoid the use of negatives in the stem (use only when you are measuring whether the respondent knows the exception to a rule or can detect er- rors). You can word most concepts in positive terms and thus avoid the possibility that students will overlook terms of “no, not, or least” and choose an incorrect option not because they lack the knowledge of the concept but be- cause they have misread the stated question. Italicizing, capitalizing, using bold-face, or underlying the negative term makes it less likely to be overlooked.
Regarding choices or options, attempt should be taken to have only one correct answer. Make certain that the item
INDIAN JOURNAL OF APPLIED RESEARCH X 727
Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555XReseaRch PaPeR
has one correct answer. Multiple-choice items usually have at least three incorrect options (distracters). Write the cor- rect response with no irrelevant clues. A common mistake when designing multiple-choice questions is to write the correct option with more elaboration or detail, using more words, or using general terminology rather than techni- cal terminology. Write the distracters to be plausible yet clearly wrong. An important, and sometimes difficult to achieve is ensuring that the incorrect choices (distracters) appear to be possibly correct. Distracters are best creat- ed using common errors or misunderstandings about the concept being assessed, and making them homogeneous in content and parallel in form and grammar. We should refrain from “all of the above,” “none of the above,” or other special distracters (use only when an answer can be classified as unequivocally correct or incorrect). None of the above should be restricted to the items of factual knowledge with absolute standards of correctness. It is in- appropriate for questions where students are asked to se- lect “the best” answer. All of the above is awkward in that many students will choose it if they can identify at least one of the other options as correct and therefore assume all of the choices are correct – thereby obtaining a correct answer based on partial knowledge of the concept/content (Gronlund and Linn, 1990). We must use each alternative as the correct answer about the same number of times. Check to see whether option “a” is correct about the same number of times as option “b” or “c” or “d” across the instrument. It can be surprising to find that one has created an exam in which the choice “a” is correct 90% of the time. Students quickly find such patterns and increase their chances of “correct guessing” by selecting that an- swer option by default. (Gronlund and Linn 1990; Wiggins 1998; McMillan 2001).
2.5.5 Constructing Essay Items Essays can tap complex thinking by requiring students to organize and integrate information, interpret information, construct arguments, give explanations, evaluate the merit of ideas, and carry out other types of reasoning. In prac- tice, we must restrict the use of essay questions to edu- cational outcomes that are difficult to measure using other formats. Construct the item to elicit skills and knowledge in the educational outcomes. Write the item so that stu- dents clearly understand the specific task. Other assess- ment formats are better for measuring recall knowledge but the essay is able to measure deep understanding and mastery of complex information. Once you have identi- fied the specific skills and knowledge, you should word the question clearly and concisely so that it communicates to the students the specific task(s) you expected them to complete (e.g., state, formulate, evaluate, use the principle of, create a plan for, etc.). If the language is ambiguous or students feel they are guessing at “what the instructor wants me to do,” the ability of the item to measure the in- tended skill or knowledge decreases. Indicate the amount of time and effort students should spend on each essay item. In essay items, especially when used in multiples and/or combined with other item formats, you should pro- vide students with a general time limit or time estimate to help them structure their responses. Providing estimates of length of written responses to each item can also help stu- dents manage their time, providing cues about the depth and breadth of information that is required to complete the item. In restricted-response items a few paragraphs are usually sufficient to complete a task focusing on a single educational outcome.
We should stay away giving students options as to which
essay questions they will answer. A common structure in many exams is to provide students with a choice of es- say items (e.g., “choose two out of the three essay ques- tions to complete…”). Instructors, and many students, of- ten view essay choice as a way to increase the flexibility and fairness of the exam by allowing learners to focus on those items for which they feel most prepared. However, the choice actually decreases the validity and reliability of the instrument because each student is essentially taking a different test. Creating parallel essay items (from which students choose a subset) that test the same educational objectives (skills, knowledge) is very difficult, and unless students are answering the same questions that measure the same outcomes, scoring the essay items and the infer- ences made about student ability are less valid. While al- lowing students a choice gives them the perception that they have the opportunity to do their best work, you must also recognize that choice entails difficulty in drawing con- sistent and valid conclusions about student answers and performance. Consider using several narrowly focused items rather than one broad item. For many educational objectives aimed at higher order reasoning skills, creating a series of essay items that elicit different aspects students’ skills and knowledge can be more efficient than attempting to create one question to capture multiple objectives. By using multiple essay items (which all students complete), you can capture a variety of skills and knowledge while also covering a greater breadth of course content (Cashin 1987; Gronlund and Linn 1990; Worthen et al. 1993; Wig- gins 1998; McMillan 2001).
3. WHOLISTIC PERSPECTIVE Different types of questions can be devised for an achievement test, for instance, multiple choice, fill-in-the- blank, true-false, matching, short answer and essay. Each type of question is constructed differently with different principles. Instructions for each type of question must be simple and brief. Questions ought to be written in simple language. If the language is difficult or ambiguous, even a student with strong language skills and good vocabu- lary may answer incorrectly if his/her interpretation of the question is different from the author’s intended meaning (Worthen et al. 1993; Thorndike 1997; Wiggins 1998). Test items must assess specific ability or comprehension of content developed during the course of study (Gron- lund and Linn 1990). Write the questions as you teach so that your teaching may be aimed at significant learning outcomes. A tester has to devise questions that call for comprehension and application of knowledge skills. Some of the questions must aim at appraisal of examinees’ abil- ity to analyze, synthesize, and evaluate novel instances of the concepts. If the instances are the same as used in in- struction, students are only being asked to recall (knowl- edge level). Questions should be written in different for- mats, e.g., multiple-choice, completion, true-false, short answer etc. to maintain interest and motivation of the stu- dents. The teacher should prepare alternate forms of the test to deter cheating and to provide for make-up testing (if needed). The items should be phrased so that the con- tent rather than the format of the statements will deter- mine the answer. Sometimes, the item contains “specific determiners” which provide an irrelevant cue to the cor- rect answer. For example, statements that contain terms like always, never, entirely, absolutely, and exclusively are much more likely to be false than to be true. On the oth- er hand, such terms as may, sometimes, as a rule, and in general are much more likely to be true.
Besides, care should be taken to avoid double negatives,
728 X INDIAN JOURNAL OF APPLIED RESEARCH
Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555XReseaRch PaPeR
complicated sentence structures, and unusual words. The difficulty level of the items should be appropriate for the ability level of the group. Optimal difficulty for true-false items is about 75 percent, for five-option multiple choice questions about 60 percent, and for completion items ap- proximately 50 percent. However, difficulty in itself is not an end. The item content should be determined by the importance of the subject matter. It is desirable to place a few easy items in the beginning to motivate students, particularly those who are of below average ability (Wig- gins 1998; Halpern and Hakel 2003). The items should be devised in such a manner that different taxonomy levels are evaluated. Besides, items pertaining to a specific topic or of a particular type should be placed together in the test. Such a grouping facilitates scoring and evaluation. It will also be helpful for the examinees to think and an- swer the items, similar in content and format, in a better manner without fluctuation of attention and changing the mind set. Directions to the examinees should be as simple, clear, and precise as possible, so that even those students who are of below average ability can clearly understand what they are expected to do. Scoring procedures must be clearly defined before the test is administered. The test constructor must clearly state optimal testing conditions for test administration. Item analysis should be carried out to make necessary changes, if any ambiguity is found in the items (Gronlund and Linn 1990; Wiggins 1998; McMillan 2001).
4. TEST EVALUATION A good test has good items. Good test making requires careful attention to the principles of item evaluation. Of- ten students judge, after taking the exam, whether the test was fair and good. Teacher is also usually interested about how the test worked for the students.
4.1 Item Analysis Item analysis is about how difficult an item is and how well it can discriminate between the good and the poor students. In other words, item analysis provides a numeri- cal assessment of item difficulty and item discrimination. It provides objective, external and empirical evidence for the quality of the items. The objective of item analysis is to identify problematic or poor items which might be either confusing the respondents or do not have a clearly correct response or a distracter might well be competing with the keyed answer. Item analysis comprises item difficulty and item discrimination (Wiggins 1998; Riaz 2008).
4.1.1 Item Difficulty Item difficulty is determined from the proportion (p) of students who answered each item correctly. Item difficulty can range from zero (none could solve it) to hundred (all persons solved it correctly). The goal is usually to have items of all difficulty levels in the test so that test could identify poor, average as well as good students. However, most of the items are designed to be average in difficulty levels for they are more useful. Item analysis exercise pro- vides us the difficulty level of each item. Optimally difficult items are those that 50% − 75% of students answer cor- rectly. Items are considered low to moderately difficult if (p) is between 70% and 85% Items that only 30% or below solve correctly are considered difficult ones. Item Difficulty Percentage can also be denoted as Item Difficulty Index by expressing it in decimals e.g. .40 for items which could be solved by 40 % of the test-takers. Thus index can range from 0 to 1. Items should fall in a variety of difficulty lev- els in order to differentiate between good and average as well as average and poor students. Easy items are usually
placed in the initial part of the test to motivate students in taking the test and alleviating test-anxiety. The optimal item difficulty depends on the question type and number of possible distracters as well (Wiggins 1998; Riaz 2008).
4.1.2 Item Discrimination Another way to evaluate items is to ask “Who gets this item correct” − the good, average and the weak students? Assessment of item discrimination answers this query. Item discrimination refers to the percentage difference in cor- rect responses between the poor and the high scoring stu- dents. In a small class of 30 students, one can administer the test items, score them and then rank the students in terms of their overall score. Next, we separate the upper 15 students and the low 15 into two groups: The upper and the lower groups. Finally, we find how well each item was solved correctly (p) by each group. In other words, percentage of students passing (p) each item in each of the two groups is worked out. Discrimination (D) power of the item is then known by finding difference between the percentage of upper group and the low group. The high- er the difference, the greater the discrimination power of an item. An item with a discrimination of 60% or greater is considered a very good item, whereas a discrimination of less than 20% indicates a low discrimination and the item needs to be revised. An item with a negative index of discrimination indicates that the poor students answer correctly more often than do the good students. Strange! Such items should be dropped from the test. Most difficult items having negative discrimination should be removed from the quiz. 100% discrimination would occur if all those in the upper group answered correctly and all those in the lower group answered incorrectly. Zero discrimination oc- curs when equal numbers in both groups answer correctly. Negative discrimination, a highly undesirable condition, occurs when more students in the lower group than the upper group answer correctly. Items with 25% and above discrimination are considered good (Wiggins 1998; Riaz 2008).
5. CONCLUSION Tests have undergone radical changes in the past sixty years due to improvements in measurement techniques and better understanding of learning processes. From a lengthy three hours essay type examination one can ass- es more comprehensively in thirty minutes objective type paper which can assess not only the knowledge but also comprehension and application of knowledge. Additionally, a well prepared paper can evaluate the students objective- ly and quickly and large number of students in a class is not a problem. Tests are the goal posts which act as guide and motivators for students to learn. We all know from our own experiences how students prepare for the examina- tions. They not only learn what interests them the most or are presented in a better way but also what type of pa- per they expect from the teacher. Due to this factor a well prepared examination paper is a guarantee of an effective teaching learning process.
INDIAN JOURNAL OF APPLIED RESEARCH X 729
Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555XReseaRch PaPeR
REFERENCE [1]Alderson, JC, Clapham, C, & Wall, D. 1995. Language test construction and evaluation. Cambridge: Cambridge University Press. | [2] Bachman, LF, & Palmer, AS. 1996. Language testing in practice. Oxford: Oxford University Press. | [3]Bachman, LF, & Palmer, AS. 2010. Language
assessment in practice: developing language assessments and justifying their use in real world. Oxford: Oxford University Press. | [4]Fulcher, G. 2010. Practical language testing. London: Holder Education. | [5]Fulcher, G & Davidson, F. 2007. Language testing and assessment: an advanced resource book. Oxon: Routledge. | [6]Heaton, J. 2000. Writing English language tests. Beijing: Foreign Language Teaching and Research Press. | [7]Boyd, K & Davies, A. 2002. Doctor’s orders for language testers: the origin and purpose of ethical codes. Language Testing, 19(3), 296–322. | [8]Brown, F. G. 1983. Principles of educational and psychological testing. 3rd edition. New York: Holt, Rinehart and Winston. | [9]Cashin, W. E. 1987. Improving essay tests. Manhattan: Center for Faculty Evaluation and Development. | [10]Cheng, L, Watanabe, Y & Curtis, A. 2004. Washback in language testing: Research methods and contexts. London: Lawrence Erlbaum. | [11]Gronlund, NE & Linn, RL. 1990. Measurement and evaluation in teaching. 6th edition. New York: Macmillan. | [12]Halpern, DH & Hakel, MD. 2003. Applying the science of learning to the university and beyond. Change, 35(4), 37-41. | [13]Isaac, S. & Michael, WB. 1990. Handbook in research and evaluation. San Diego: CA. | [14]Kunnan, AJ. 2000. Fairness and validation in language assessment. Cambridge: Cambridge University Press. | [15]Milanovic, M & Weir, C. 2004. European language testing in a global context: proceedings of the ALTE Barcelona conference. Cambridge: Cambridge University Press. | [16]Li, X. 1997. The science and art of language testing. Changsha: Hunan Education Press. | [17]McMillan, JH. 2001. Classroom assessment: Principles and practice for effective instruction. Boston: Allyn and Bacon. | [18]McNamara, T & Roever, C. 2006. The social dimensions of language testing. Oxford: Blackwell Publishing. | [19]Riaz, MN. 2008. Test Construction: Development and Standardization of Psychological Tests in Pakistan. Islamabad: HEC. | [20]Shohamy, E. 2001a. The power of tests: a critical perspective on the uses of language tests. Essex: Pearson Education. | [21]Shohamy, E. 2001b. Democratic assessment as an alternative. Language Testing, 18(4), 373–391. | [22]Svinicki, MD. 1999. Evaluating and grading students. Austin: University of Texas. | [23]Thorndike, RM. 1997. Measurement and evaluation in psychology and education. New Jersy: Prentice-Hall. | [24]Wiggins, GP. 1998. Educative assessment: Designing assessments to inform and improve student performance. San Francisco: Jossey-Bass. | [25]Worthen, BR, Borg, WR & White, KR 1993. Measurement and evaluation in the schools. New York: Longman. |
View publication statsView publication stats