Summery Of the chapter 1 in power point
Practical Issues Related to Testing
FACTORS MAKING FOR PRACTICALITY IN ROUTINE TEST USE
Although validity and reliability may be all-important for measures in psychology and education, when a test is to be used in classrooms throughout a school or school system or in any large or ongoing testing application in government or industry, a number of down-to-earth, practical considerations must also be taken into account. It is easy for the administrator to pay too much attention to small financial savings or to economies of time that make it possible to fit a test into a standard class period with no shifting of schedules; nevertheless, these factors of economy and convenience are real considerations. Further-more, other factors relating to the readiness with which the tests may be given, scored, and interpreted bear importantly on the use that will be made of the tests and the soundness of the conclusions that will be drawn from them.
Economy
The practical significance of dollar savings need not be emphasized in the current environment of tight educational and corporate budgets. Dollars are of real significance for any educational or industrial enterprise, and research users must also watch expenses. Economy in the case of tests depends in part on the cost of test materials and scoring services per examinee and in part on the possibility of reusing the test materials. For example, in the upper elementary grades and beyond, tests can be administered that have separate test booklets and answer sheets, so that the test booklets can be used for a number of testings. Also, if a test is used in successive years or if testing can be scheduled so that different classes or schools can be tested on different days, an important economy can be effected. Publishers sell test booklets and answer sheets separately for just this reason.
A second aspect of economy is time savings in test administration; however, this economy is often a false one. We saw in Chapter 4that the reliability of a test depends largely on its length. As far as testing time is concerned, we get about what we give. Some tests may be designed a little more efficiently so that they give a little more reliable measurement per minute of testing time, but, by and large, any reduction in testing time will be accomplished at the price of loss in precision or breadth of our appraisal, unless it is possible to use a computer adaptive test. This option, however, requires that a sufficient number of appropriate computers be available and that the examinees be capable of responding in this format.
A third, and quite significant, aspect of economy is the ease of scoring. The clerical work of scoring a battery of tests by hand can become either burdensome if it is done by the already busy teacher or psychologist or expensive if it is carried out by clerical help hired especially for the purpose. As a result, test users rely heavily on mechanized scoring, and test publishers are producing tests that can be processed by the increasingly sophisticated equipment being developed. A number of commercially published tests, such as the SAT and the Strong Interest Inventory can be scored only by the publisher or at special centers licensed by the publisher. Other tests, such as the individually administered Stanford Binet and Wechsler scales, are scored only by hand. But, for many published tests, test users have the option of hand scoring their own tests, of setting up their own mechanized scoring units, or of sending tests to one of the test scoring services that specialize in test processing.
Computer Scoring
Several test-scoring services, many run by the test publishers themselves, provide efficient test scoring and reporting, often giving 24-hour turnaround; that is, they will score the tests and put the results in return mail within 24 hours of their receipt of the completed answer sheets. Also, test scoring equipment is becoming so inexpensive and readily available that many schools, school districts, or personnel offices maintain their own scoring services. The basic equipment consists of a photoelectric document reader combined with a computer. The document reader responds to marks on an answer sheet or in the test booklet. (At least three companies sell the scanning part of a test scoring system for under $5,000, a cost that is within the means of many school districts and modest-sized companies.)
Using a separate answer sheet is familiar to most American college students, but separate answer sheets are not satisfactory for young children. The complications of finding an answer in the test booklet, keeping the code letter or number of the chosen answer in mind while locating the proper place on the answer sheet, and then marking the proper spot on the answer sheet are too much for children in the primary grades. However, current equipment makes it possible to use a booklet at the primary level. Students mark their answers directly in the test booklet. For scoring, the test can be designed so that either one slices the bound edge off the test booklet and runs the separate pages through the scanner, or the test is printed in a fanfold booklet that can be un-folded and run through the scanner as a unit. Modern scanning equipment can handle 50 or more different answer sheet patterns in a single run.
The information from the optical scanner is fed into the computer where it is compared with a key that has been recorded in the computer’s memory. One or more scores are determined and stored in a file for later analysis or printing out on a record form such as those described in Chapter 3. The computer can also produce various statistics, such as means and standard deviations, as well as local percentiles, for classes, schools, or school systems. Thus, the scoring service, whether it is a commercial agency or a district or local office, can generally provide not only test scores for individuals, but also the complete range of statistical information about the test results that might be of interest to the local teacher or school system. Some of the scoring services will also accumulate results over a period of years, providing a longitudinal picture of the local test results. One advantage to the local school of having its own scoring and computing equipment is that it is then much easier to carry out item analyses (see Chapter 9) on locally constructed tests, as well as to accumulate local norms and longitudinal results.
Perhaps one of the least expected developments in test processing software has been the development of programs that score essay tests (Page, 1994) using scoring rubrics and model answers. A rubric is a set of directions for how to compare an answer with the model answer and how to evaluate elements such as grammar and spelling. Page and his associates report that their program produces results that are as reliable as the average of up to six human judges. Of course, the quality of the output depends critically on the adequacy of the scoring rubric, but as more experience is accumulated with computerized scoring of produce-response tasks, we may see a move toward increased use of items of this type in standardized tests.
A test deemed economical for use in large-scale testing programs will be available in a format for scoring by different types of scanning and scoring equipment. Almost any type of answer sheet can be hand scored relatively efficiently by using an overlay stencil. Some test publishers supply plastic overlay stencils for this purpose. Others use a special type of answer sheet made up of two sheets fastened together, the back of one being carbon covered or pressure sensitive. The key is printed on the back of the first sheet, and as the examinee marks the front, the carbon-covered or pressure-sensitive material transfers the marks to the back of the sheet, where the number of marks falling in the printed key spaces can be counted.
Potential test purchasers will want to determine what types of answer forms and what scoring services are available for any test under consideration. If scoring is to be done locally, they should know of any restrictions imposed by the available scanner. Test publishers who allow local scoring of their tests will generally have answer sheets that can be scored by the common types of scanners. Several of the popular tests also have software packages available that will pro-duce score profiles and the various types of reports described in Chapter 3.
Computerized Test Interpretation
Test authors and publishers have increasingly relied on computers to generate narrative reports of test results that interpret the numerical scores for the user. A catalog of statements is stored in computer memory, and the computer is programmed so that a score falling in a given range, a particular combination of two or more scores, or even a combination of responses to specific questions triggers the production of one or more of these statements. The elements in the narrative may be as simple and as directly related to a particular score as “Henry is somewhat below the average in arithmetic problem solving” or as extended as the interpretive personality picture provided in Chapter 14. There port may focus solely and very specifically on one examinee’s scores, or it may include an extended general discussion of the conceptual basis for the instrument. For example, CPP produces a report for the Strong Interest Inventory that provides, as a basis for interpreting the examinee’s scores, a full exposition of the general nature of the instrument and of the rationale for the different types of scores that it produces, together with suggested sources for further information about occupations that appear congruent with the examinee’s pattern of interests and a nine-page “personalized” in-depth interpretation. As we saw in Chapter 3, scoring services can also produce computer-generated summary reports for classes and districts.
Narrative computer printouts can be no better than the wisdom and clinical experience on which they are based. However, they do permit the distillation of that experience as it has accumulated in the professional literature and in the pooled background of a number of experts. Computer-generated printouts protect the examinee from possible bias or inexperience of a local practitioner, while also expediting what can be a time-consuming and tedious chore of report writing. Their use guarantees that aspects of the record dealt with by the narrative will not be in-advertently overlooked by a current interpreter and also ensures uniformity and consistency in interpretation. The narrative does need to be evaluated by a counselor or clinicians who know so there facts about the client, but it can serve as a useful foundation for such an evaluation.
Features Facilitating Test Administration
In evaluating the practical usability of a test, one factor to be taken into account is the ease of ad-ministration. A test that can be handled adequately by the regular classroom teacher with nomore than a session or so of special briefing fits much more readily into a testing program than atest requiring specially trained administrators and large blocks of time for testing. Several facts contribute to the ease of giving and taking a test:
1. A test is easy to give if it has clear, full instructions. The instructions for test administration should be written out completely, so that all the examiner must do is read and follow them. Every-thing the examiner should say to the examinees should be included, verbatim, in the instructions, preferably in contrasting color or boldface type. Instructions for the examinee should also be complete and should provide appropriate practice exercises. The amount of practice that should be provided depends on how novel the test task is likely to be for those being tested. When it is a familiar type of task or a simple and straightforward instruction, no more than a single example maybe needed; however, for an unusual item format or a complex test task, more practice will be desirable. The better test publishers make practice tests available so students can gain experience with the types of items included on the test.
Familiarity with the types of tasks the test requires is also a possible source of irrelevant differences between examinees. Many underprivileged children are not as familiar with testing situations as are children from more advantaged homes. A plentiful supply of practice items can help reduce the negative effect of task unfamiliarity and yield a more accurate picture of the abilities of disadvantaged children. With the prevalence of tests in society, instruction in how to take tests(through administering a number of practice tests) may even be a reasonable educational activity in the earlier grades because it would reduce “testwiseness” as one irrelevant source of individual differences in scores.
2. A test is easy to give if there are few separately timed units and if close timing is not crucial. Timing a number of brief subtests to a fraction of a minute is a bothersome undertaking, and the timing is likely to be inaccurate unless a stopwatch is available for each tester. Some tests have as many as 8 or 10 parts, each taking only 2 or 3 minutes. A test made up of 3 or 4 parts, with time limits of 5, 10, or more minutes for each, will be easier to use. Errors in timing, which can produce inaccurate test scores, are also less likely with fewer and longer parts. However, one advantage of computer-administered tests is that timing problems can be eliminated.3. The layout of the test items on the page has a good deal to do with the ease of taking the test. Items in which the response options all run together on the same line or those with small or illegible pictures or diagrams or which are crowded together or run over from one page to the next all create difficulty for the examinee. Item difficulty should come from the content and the cognitive processes required to determine the right answer, not from the test format. Print and pictures should be large and clear. Response options should be well separated from one another. All parts of an item and all items referring to a single figure, problem, or reading passage should appear on the same page or a double-page spread. Also, to the extent that the physical layout of the test is difficult for examinees to follow, scores may be lower than they should be, a fact that would adversely affect the validity of the test and be particularly detrimental for a criterion-referenced test.
Features Facilitating Interpretation and Use of Scores
Although the point is sometimes overlooked, it seems axiomatic that a test is given so that the results can be used. If the score is to be used, it must be interpreted and given meaning. The author and publisher of the test have the responsibility of providing users with information that permits them to make sound appraisals of the test in relation to their needs and to give appropriate meaning to the score of each individual. The authors and publishers do this primarily through the test manual and other collateral materials prepared to accompany the test. What may the test user reasonably expect to find in the manual for a test, together with its supporting materials? We have outlined below the aids that the user should expect.
1. A statement of the functions the test was designed to measure and of the general procedures by which it was developed. In this statement, the author tells what he or she considers to be the appropriate uses of test scores and provides the evidence that proper steps have been taken to en-sure that recommended interpretations are supported. Particularly for achievement tests, in which the primary concern is that the test measures specific content areas and cognitive processes, the manual should describe the procedures by which the choice of content was made or how the analysis of the functions being measured was carried out. If the author is unwilling to expose his or her thinking to our critical scrutiny, we may perhaps be skeptical of the thoroughness or profundity of that thinking.
Procedures to be reported include not only the rational procedures by which the range of content or types of objectives were selected but also the empirical procedures by which the items were reviewed, tried out, and screened for final inclusion in the test. This description should include the composition of review panels and the statistical procedures employed to analyze test results for possible bias. The validity of the test—the interpretations it supports, caveats regarding these interpretations, and the uses to which the scores may properly be put—is its most important property. The author’s plan and procedures for construction are vital steps in achieving validity and should be explicitly stated so that potential users can judge the quality of these activities.
2. Detailed instructions for administering the test. Earlier in this section we discussed the need for this aid to uniform and easy administration by teachers or others who will have to use the test. Of course, proper administration is also vital to achieving scores that are valid for the user’s purpose. To the extent that directions for administration are violated, the validity of the tests core may be compromised.
3. Scoring keys and specific instructions for scoring the test . The problems of scoring have also been discussed. If the test can or must be scored locally, the manual and supporting materials should provide detailed instructions on how each score is to be computed, how errors are to be treated, and how part scores are to be combined into a total score. Scoring keys and stencils should be planned to facilitate, as much as possible, the onerous task of hand scoring when scoring is to be done by the local user, and instructions should also be given for electronic scoring.
4. Norms for appropriate reference groups. These norms, together with information on how the norms were obtained and with instructions for their use, should be included in the test manual or a separate publication devoted to normative information. A full consideration of types of test norms and their uses was presented in Chapter 3. It is therefore sufficient at this time to point out the responsibility of the test producer to develop suitable norms for the groups with which the test is to be used. General norms are a necessity, and norms suitable for special types of communities, special occupational groups, and other more limited subgroups will add to the usefulness of the test in many cases.
5. Evidence of the reliability of the test. This evidence should indicate not only the simple reliability statistics but also the operations used to obtain the reliability estimates and the descriptive and statistical characteristics of each group on which reliability data are based. If a test is available in more than one form, it is highly desirable for the producers to report the correlation between the two forms, in addition to any data that were derived from a single testing. If the test yields part scores, and especially if it is proposed that any use be made of the pattern of these part scores, reliability data should be reported for the separate part scores. It is good procedure for the author to report standard errors of measurement, as well as reliability coefficients. An author who indicates what the standard error of measurement is at each of a number of score levels should particularly be commended be-cause this information shows the range of scores over which the test retains its accuracy.
6. Evidence on the intercorrelations of sub scores. If the test provides several sub scores, the manual should provide information on the intercorrelations of the subscores. This information is important in guiding the interpretation of the subscores and particularly in judging how much confidence to place in the differences between the subscores. If the correlations among the subscores approach theirreliabilities, indicating that they measure much the same things, the differences between them will be largely meaningless and uninterruptable. Information on the reliability of the subscores, coupledwith knowledge of their correlations, permits an accurate evaluation of the degree to which difference scores add information to the original scores. Some test publishers also provide information on the frequency with which score differences of a particular magnitude occurred in the standardization sample. This is excellent practice when patterns of differences between scores are claimed to have particular interpretive meaning. Factor analyses of the correlations among subscores should be re-ported to confirm that any recommended combinations of subscores can reasonably be considered to measure a common dimension. Such analyses support claims of content and construct validity.
7. Evidence on the relationship of the test to other variables. Insofar as the test is to be used as a predictive device, correlations with criterion measures constitute the essential evidence on how well it does in fact predict. Full information should be provided on the nature of the criterion variables, the groups for which data are available, and the conditions under which the data were obtained. Only then can the potential test user fairly judge the validity of the test as a predictor.
It will often be desirable to report correlations with other measures of the same function as collateral evidence bearing on the construct validity of the test. For example, correlations with an individual intelligence test are relevant in the case of a group measure of intelligence.
Finally, indications of the relationship of test scores to age, gender, type of community, socioeconomic level, and similar facts about the individual or group are often helpful. They pro-vide a basis for judging how sensitive the measure is to the background of the group members and to the circumstances of their lives and their education. Evidence of this kind is also useful for judging whether the norm groups used by the publisher are appropriate for the local population.
8. Guides for using the test and for interpreting results obtained with it . The developers of a test presumably know how it is reasonable for the test to be used and how the results from it should be evaluated. They are specialists in that test. For the test to be most useful to others, especially the teacher with limited specialized training, suggestions should be given of ways in which the test results may be used for diagnosing individual and group strengths and weaknesses, forming class groupings, organizing remedial instruction, counseling the individual, or whatever other activities may appropriately be based on results from that particular type of instrument. Computerized test interpretations can be particularly valuable here.
A final point to keep in mind is that the care that has gone into preparing the test manual is often itself a good indicator of the care that has been exercised in constructing the test. Test authors who do a careful, thorough job of describing how the test was constructed and how evidence of its reliability and validity was obtained probably cared enough to give their very best. They are likely to have carried out the steps for constructing the test in a thoughtful, professional manner as well.
E-Testing
A recent innovation that can make test use much easier but that has some potential pitfalls is test ad-ministration over the Internet. The Psychological Corporation has developed a Web site they call the Psychological Corporation Assessment Center (http://www.PsychCorpCenter.com) where qualified professionals in assessment can order a variety of instruments for their clients. Clients are notified bye-mail that an assessment instrument has been ordered for them and they can log onto the Web site and complete the assessment at their convenience. The instrument is then scored and an interpretive report generated by the company’s software. The average time for the scoring process is claimed to be3 minutes. The person who ordered the assessment is notified that the testing has been completed and can download the report. The potential time savings for both the examiner and the client could be substantial. At this writing, 32 instruments from the Psychological Corporation catalog are avail-able in this format. Consulting Psychologists Press (CPP), another major test publisher, offers five instruments for e-testing, and we may expect other large test publishers to follow suit soon.
The Psychological Corporation has taken elaborate measures to ensure the security of the identity of the tester and examinee, the test responses, and the report generated from the e-test. This, of course, is essential for responsible test use and reduces the worry about test security. However, two additional caveats remain regarding Internet testing. The first, and less critical, potential problem is one of control of the testing environment and applicability of test norms. Norms for tests are developed in particular environments in which efforts are made to ensure that examinees are taking the testing seriously and responding in a conscientious and truthful manner. Until research has shown that responses obtained over the Internet correspond to those obtained using more traditional methods, we must treat with caution interpretations of Internet test scores based on norms obtained in other ways.
The second and more serious problem is that we have no way of knowing who is actually providing responses to the test if the testing is taking place in the privacy of one’s home. The Psychological Corporation attempts to control this problem by requiring the examinee to log on with identifiers provided by the person who ordered the test, but this cannot guarantee the identity of the person who actually fills out the test. Some of the tests included in the Psych Corp Center list are ability measures that a company might use for employee selection. Examinees motivated by a desire to obtain employment could fill out the test with the help of friends or relatives, or could even have another individual take the test in their place. The only sure solution to this difficulty seems to be to have examinees take their tests using computers supplied by the testers at locations that can be monitored, but this eliminates much of the convenience that Internet test administration might offer. (In a related development, Petrill, Rempell, Oliver, and Plomin, 2002, have shown that it is possible to administer cognitive ability measures over the telephone. In this case, however, the examinees were children and a parent helped with test administration.)
GUIDE FOR EVALUATING A TEST
As an aid to the potential user, we end this section with a guide for evaluating a test. This guide consists of a series of questions, based in large part on the Standards for Educational and Psychological Testing (American Education Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). Careful study of the complete Standards will reward the person who would become a more sophisticated test user. (The most recent editions of the Standards differ from their predecessors in that they place more emphasis on the proper interpretation and use of test results rather than on test development. The validity of test scores depends heavily, some would say entirely, on the context of their use, and it is the responsibility of the test user to guarantee that inferences and decisions based on test scores are supported by appropriate evidence. )
You will note that many of the questions in this guide relate to the availability and adequacy of information reported about the test. There is an implied, although not explicitly stated, second question as a sequel to many of the sections of the guide, especially those relating to reliability and validity. That question is “Given that the information is provided, how satisfactory does the test appear to be, in comparison with others that are available, as well as by absolute standards, for the use I want to make of it?” A significant portion of the responsibility for valid testing rests with the test user. You must evaluate the evidence provided by the publisher, as well as what you know about the examinees and the intended uses of the test scores. In light of this information, you must decide whether the evidence justifies using this test for this purpose. A number of questions in this guide also refer to the adequacy of norms and converted scores. You may wish to review Chapter 3as you study this portion of the guide. General Identifying Information
1. What is the name of the test?
2. Who are its authors—by name and position if that information is available? (It should be.)
3. Who publishes the test? And when was it published?
4. Is more than one form of the test available? If so, how many?
5. How much does the test cost?
6. How long does it take to administer the test?
Information About the Test
1. Is there a test manual (or some other similar source of information, such as an article in a journal) that is designed to provide the information a potential user needs to ad-minister the test and interpret the results properly?
2. How recently have there been revisions in the test, the manual, and the norms?(For major commercial tests it is reasonable to expect that both the test and its manual will be revised at least every 10 years and that new norms will accompany the revisions.)
Aids to Interpreting Test Results
1. Does the manual provide a clear statement of the purposes and applications for which the test is intended?
2. Does the manual provide a clear statement of the qualifications needed to administer the test and interpret its results properly?
3. Do the test, manual, record forms, and accompanying materials guide users toward sound and correct interpretation of the test results?
4. Are the manual’s statements expressing relationships presented in quantitative terms so that the reader can tell how much confidence to attach to them? (For example, it is much more informative to say that the correlation with Variable X is .55 than to say that the test correlates substantially with Variable X.)
Validity
1. Does the manual report evidence on the validity of the test for each type of inference for which the test is recommended?
2. Does the manual avoid referring to correlations between items and total test score as evidence of validity?
3. If the test is designed to be a sample of a specified domain of behaviors (e.g., an achievement test), does the manual define the domain clearly and indicate the procedures used for sampling from that domain?
4. When criterion-related validity is involved, does the manual describe the criterion variables clearly, comment on their adequacy, and indicate what aspects of the criterion performance are not adequately reflected in these criterion measures?
5. Are the samples used for estimating criterion-related validity adequately described? And are they appropriate to the purpose? One way spuriously to inflate apparent validity is to use samples that are much more heterogeneous than those that can be expected in actual use. Are such tricks avoided?
6. Are statistical analyses for criterion-related validity presented in a form that permits the reader to judge the degree of confidence that can be placed in inferences about individuals?
7. If the test is designed to measure a theoretical construct (e.g., trait of ability, temperament, or attitude), is the proposed interpretation clearly stated and differentiated from alternate theoretical interpretations? Is the evidence to support this interpretation clearly stated and fully presented?8.Are potential threats to valid use of the test scores, such as language handicaps and learning disabilities, identified and appropriate cautions noted? Does the manual consider possible unintended outcomes of using the test?
In summary, to what extent does the available validity evidence justify the uses of the test suggested in the manual or the use that you would want to make of the test results? Remember that if you intend to use the test for some purpose other than that contemplated by the test developer, the responsibility for demonstrating the validity of the test for that use rests with you.
Reliability
1. Does the manual present data adequate to permit the reader to judge whether scores are sufficiently dependable for the recommended uses (or for your contemplated uses, if these are different from the recommended ones)?
2. Are the samples on which reliability data were obtained sufficiently well described that he user can judge whether the data apply to his or her situation? As with validity, one-way spuriously to inflate apparent reliability is to use samples that are much more heterogeneous than those that can be expected in actual use. Are such tricks avoided?
3. Are the reliability data presented in the conventional statistical form of product moment correlation coefficients and standard errors of measurement? Are standard errors presented for different levels of performance?
4. If more than one form of the test was produced, are data provided to establish comparability of the forms?
5. If the test purports to measure a generalized homogeneous trait, is evidence reported on the internal consistency (interitem or interpart correlations) of the parts that makeup the test?
6. Does the test manual provide data on the stability of test performance over time?
In summary, to what extent do the reliability data provided in the manual justify the uses for the test results suggested by the authors or the uses that you want to make of the test results?
Administration and Scoring
1. Are the directions for administration sufficiently clear and fully stated so that the administrator will be able to duplicate the conditions under which the norms were established and the reliability and validity data were obtained?
2. If the test is administered by computer or provides computer-assisted administration, are the programs written to run on the equipment you have available? Is adequate documentation and technical support provided?
3. If the test can be scored locally, are the procedures for scoring set forth clearly and in detail, in a way that will maximize scoring efficiency and minimize the likelihood of scoring error? Are the directions for determining subscores clear? If scoring software is available, are adequate?
4. If the test is to be scored by the publisher or a commercial scoring service, does the test scoring service provide for the accumulation of results over time to aid in preparing local norms and local validity evidence?
Scales and Norms
1. Are the scales used for reporting performance clearly and carefully described so thatthe test interpreter will fully understand them and be able to communicate the inter-pretation to an examinee?
2. Are norms reported in the manual in appropriate form—usually standard scores or percentile ranks in appropriate reference groups?
3. Are the populations to which the norms refer clearly defined and described? And are they populations with which you can appropriately compare your examinees?
4. If more than one form of the test is available, including revised forms, are tables pro-vided showing equivalent scores on the different forms?
5. Does the manual discuss the possible value of local norms? Does it provide any helping preparing local norms?
6. Is a computer program available to assist with test interpretation and report writing? If so, is the program appropriately interactive; do the interpretations and reports it generates make suggestions about test results and avoid drawing definitive conclusions?
In summary, many factors should be considered when selecting a test, no one of which is conclusive. It will be your responsibility as a measurement professional to gather evidence related to using a particular test for a particular purpose. No test is perfect, and you will have to weigh the benefits of using each test against the possible costs and consequences. Only when you feel confident that the advantages outweigh the risks should you use the test.
GETTING INFORMATION ABOUT SPECIFIC TESTS
There were 2,846 published tests in English available for sale in 2006, an increase of 66 from 4years earlier. In addition, many more than this number are out of print, were produced locally, or were developed for specific research projects. Many of these are available from one source or another. A single text cannot even begin to list, much less describe, all the instruments developed that might interest you. A few of the best known and most widely used tests will be reviewed in later chapters as illustrations of different types of tests. For the rest, we try in the remainder of this chapter to provide you with some tools to find the tests you need and information about them. We have organized the sections around the questions that you are likely to ask, in the order that you would ordinarily ask them, and we try to give useful ways to find answers to these questions.
What Tests Exist?
What tests of reading comprehension are suitable for fourth-graders? What tests of vocational interests exist for high school seniors? What measures of attitudes toward nuclear power are available? The first thing we need to know is what already exists. When we have a catalog of possibilities, we can begin to pick and choose. Or, if nothing satisfactory is available, we can undertake to construct something from scratch. Where should we go to assemble a catalog of possibilities?
The first source to turn to is Tests in Print VII (Murphy, Spies, & Plake, 2006). This catalog (also known as TIP-VII ) is revised about every 5 years and provides an alphabetical listing of the 2,846 instruments that were confirmed as being available from commercial test publishers at the time of its re-lease, with information on when each was published and by whom. An 18-category subject matter index is provided to help the reader locate the tests in a given category. There is also a list of tests that were reviewed in some edition of the Mental Measurements Yearbook (see below) that have gone out of print and an index of test publishers, a list of test acronyms, a list of author names, and an index of score labels. Tests in Print is a companion volume to the Mental Measurements Yearbook series (hereafter referred to as MMY ), which is described in a later section. Tests in Print includes references to entries in the MMY s dealing with the test in question, as well as updates of the bibliography of references relating to that test. Each edition of Tests in Print replaces the previous one, deleting entries for tests no longer available and adding references to new ones. Entries include the test name, purpose, intended population, publication date, acronym (if one exists), scores provided, type of administration (individual or group), price, testing time, authors, publisher, and a cross reference to MMY reviews.
Two companion sources to Tests in Print are Tests: A Comprehensive Reference for Assessments in Psychology, Education and Business (hereafter referred to as Tests ) (Maddox, 1997) and the Dictionary of Behavioral Assessment Techniques (Dictionary) (Hersen & Bellack, 1988). The entries in these volumes are alphabetically arranged within topic areas and have an advantage over Tests in Print in that they provide a brief description of each entry, as well as information about the publisher and price. The descriptions are not evaluative; they closely follow the author’s or publisher’s statement of what the test measures. The information in Tests is somewhat sketchy, in that it does not identify the author, give the date of publication of an instrument, or give any statistical information about it. Tests contains about 2,500 entries for current tests and an index of out-of-print tests and their publishers. The Dictionary lists 288 clinical rating scales and questionnaires
and does give the names and addresses of sources for the instruments.
Tests in Print, the Dictionary, and Tests are useful for finding tests, but they fall short in two respects. They do not give information on the most recent tests, and, except for some entries in the Dictionary and the out-of-print list in TIP-VII, they do not give information about unpublished instruments. (By unpublished instruments, we mean tests that are not offered for sale. An attitude scale that is given in full in a book or article is, for our purposes, unpublished unless it sauthor offers it for sale.)
To find the most recently published tests, you need to obtain copies of test publishers’ cur-rent catalogs. They describe the tests and services that the publisher is currently promoting and even some expected to be available in the near future. A large number of publishers produce an occasional test, but relatively few are regular and substantial producers of testing instruments. A complete listing of test publishers, along with addresses and, in most cases, e-mail or Web site information, can be found in Tests in Print VII. A file of current test catalogs may sometimes be available through the measurement facility of your university or through the testing center in your local school system. If such a resource is not available, write to those publishers who have produced tests like the one you seek and request a copy of their current catalog. Some professional journals carry ads from test publishers, as do The Monitor, the newsletter of the American Psychological Association, and similar publications from other professional associations. Most of the larger publishers also maintain Web sites that contain information about their products. If you have the name of a test but do not have access to the references we have listed here, a Google search will often lead you to the publisher’s Web site or to an article or book describing the test.
Locating unpublished instruments that have been locally developed or that have been used only in research studies can be a bit tricky. However, several source books and directories have been prepared that provide assistance in such an undertaking. Probably the most useful of these sources is the test collection Tests in Microfiche, developed by Educational Testing Service (ETS). Beginning in 1975, ETS started distributing not only an index of unpublished tests but of unpublished tests but also copies of the tests on microfiche. By 2004, when the service was discontinued, the collection included more than10,000 instruments. The ETS collection Tests in Microfiche is a service to which many university libraries subscribed, and your library may still be able to give you access to these materials. How-ever, the subscription contains the following restriction (printed in the Annotated Index ):
The materials included in the microfiche may be reproduced by the purchaser for his or her own use. Permission to use these materials in any other manner must be obtained directly from the author. This includes modifying or adapting the materials or selling or distributing them to others. (p. iii)
At this writing, the Buros Institute of Mental Measurements at the University of Nebraska–Lincoln provides easy access to the ETS collection by a link from their Web site at http://buros.unl.edu/buros/jsp/search.jsp, Test Reviews Online. A Google search on “ETS Test Collection” will also lead you there. The ETS Web page describes the collection as follows:
The Test Collection at ETS is a library of more than 25,000 tests and other measurement devices that makes information on standardized tests and research instruments available to researchers, graduate students, and teachers. Collected from the early 1900s to the present, the Test Collection at ETS is the largest such compilation in the world.
It is a valuable resource for anyone in search of information about what unpublished tests are available.
A number of other sources (now rather dated) also provide references to unpublished tests. The Directory of Unpublished Experimental Mental Measures (Goldman & Busch, 1978,1982; Goldman & Mitchell, 1990; Goldman & Osborne, 1985; Goldman & Sanders, 1974)includes more than 3,000 entries describing such instruments, while Tests and Measurements in Child Development: Handbook II (Johnson, 1976) and Tests and Measurements in Child Development: A Handbook (Johnson & Bommarito, 1971) are other sources of unpublished instruments for use with young children. The Handbook of Tests and Measurement in Education and the Social Sciences (Lester & Bishop, 1997) lists about 80 relatively obscure instruments primarily related to education.
Several volumes cover the field of attitude measurement and reproduce the actual scales used (Robinson, Athanasion, & Head, 1969; Robinson, Rusk, & Head, 1968; Robinson &Shaver, 1973; Shaw & Wright, 1967). More recently, O’Brien (1988) has published a bibliography of selected references on testing, which includes 2,759 sources categorized by the topic or subject covered in the article, and ETS has published a six-volume Test Collection Catalog listing(1) achievement tests, (2) vocational tests, (3) tests for special populations, (4) cognitive, aptitude, and intelligence tests, (5) attitude tests, and (6) affective measures and personality tests. In addition to the sources already mentioned, the Sources of Test Information list at the end of this chapter presents some references that cover more limited areas.
A newer resource that describes a variety of instruments for use with family counseling and research is the three-volume set of Handbook of Family Measurement Techniques. Volume 1 (Touliatos,Perlmutter, & Straus, 2001) and Volume 2 (Touliatos, Perlmutter, & Holden, 2001) containabstracts of research reports using 168 instruments. Volume 3 (Perlmutter, Touliatos, & Holden,2001) contains copies of the instruments, scoring instructions, and references to the source articles in which the instruments were first presented. If you need tests for use in a family counseling environment, this resource may prove useful.
An additional resource that may be useful in certain circumstances is the Directory of Selected National Testing Programs (Educational Testing Service, Test Collection, 1987). This three-volumeseries gives the names, publishers or producers, addresses, and general descriptions of a widerange of national testing programs. Volume 1 covers selection and admission programs for sec-ondary and postsecondary institutions, government service, graduate and professional schools,and health-related programs. Volume 2 lists academic-credit and advanced-placement testingprograms, and Volume 3 gives the testing programs for certification and licensing. The tests cov-ered in this series are not generally available for public review.
Exactly What Is Test X Like?
Once you have identified a promising reading test, interest inventory, or attitude measure (called, say, Test X), you will want to find out more about it. Where should you turn?
A certain amount of descriptive information about the test will appear in some of the sources listing tests and in evaluations that appear in the MMY s. However, there is really no substitute for examining the test firsthand. So, the first thing to do, if possible, is to obtain a copy of the test and look at it. A specimen set, which usually includes a copy of the test, an answer sheet, and the manual for the test, may be available through your university, in either the library, counseling center, or measurement-area files for published tests. Or the testing office of your school system maybe able to provide one. Another resource for locating copies of tests is the Directory of Test Collections in Academic, Professional, and Research Libraries (Fehrmann & O’Brien, 2001). This volume lists 77 libraries that allow outside professionals to use their facilities and have collections of at least 100 tests. If you cannot examine the test of your choice in one of these ways, you may have to order a specimen set from the publisher; the catalog will indicate the price.
A publisher is likely to require some credentials to show that you are an appropriate person to have access to the testing materials. A letter from your instructor or from a supervisor in the company or school system where you work may suffice. However, some types of instruments, such as personality inventories and individually administered cognitive ability tests, that call for special training or skills will have further restrictions on their distribution, and you may have to complete a form to verify that you have the required qualifications. In a few cases, it may be necessary to have a person with the required professional credentials order the test for you and supervise your examination of it.
Many of the unpublished instruments are reproduced in full in some of the compendia listed in the Sources of Test Information section at the end of this chapter, particularly in the Tests in Microfiche collection. Others are reproduced in full in articles reporting their development and use. If an unpublished test you wish to examine is not available from one of these sources, you may have to try to get a copy from its author.
Once you have a copy of the test or inventory, what information should you try to get from it? The answer depends on the type of test. If it is a test of school achievement, you should examine the items and ask yourself if the content covered and the processes called for match the objectives you have set for your teaching or that are contained in the school or district curriculum guide. For all kinds of tests, you should ask whether the items are clearly stated, the answer choices plausible, the directions clear, and the page layouts attractive and legible. Of course, for unpublished tests, the quality of the test materials may ultimately be up to you because you are likely to have to produce your own copies of the test after obtaining appropriate permission.
Of equal importance to the test itself are the supporting materials that describe the test’s psychometric properties, the form in which test results are to be reported, and the aids provided for test use and interpretation. For tests available from commercial publishers, this information should appear in one or more test manuals, possibly in a manual for test administrators or one for counselors or supervisors, and in a technical manual that gives, in considerable detail, the psychometric properties of the test. You should examine these manuals thoroughly.
For unpublished tests, the primary sources of technical material about the tests will be in articles or books reporting studies in which the tests were used. The quality of this information is likely to be much lower than that which you will find in the manual for a good commercial test. For relatively new tests or tests that have not been widely used, it is often necessary to con-tact the author. It is not uncommon to encounter a measuring instrument developed for use in a research study, for which there is no information about reliability or validity except for the findings from that one study. In some cases there may be no evidence at all of instrument quality. When this happens, the test should not be used other than for research purposes, and then only with caution.
Most commercially published tests provide accompanying scoring and reporting services. The services range from simply scoring the test answer sheets and reporting raw scores and percentiles or standard scores, to providing extended narrative interpretations of each individual’s test results. Examples of some of these reports were given in Chapter 3. The specimen set should describe, and perhaps illustrate, the types of reports that are available, and you should determine how adequately these will serve your needs.
The evidence on reliability and validity should be scrutinized with a particularly critical eye. Remember that the specimen set is primarily a promotional piece. You can expect the publisher to accentuate the positive, so try to cut through the puffery and get down to the basic evidence;be suspicious if evidence is incompletely or vaguely reported. Chapters 4and 5and the guide fortest evaluation provided earlier in this chapter indicate what evidence you can reasonably expectto find.
What Do Critics Think of Test X?
Because materials from the test publisher focus on selling the test (some do so blatantly; some, subtly), it is highly desirable to get an evaluation by a competent and unbiased re-viewer. The one source to automatically consult for such critical reviews is the series of Mental Measurements Yearbook. This series was initiated by the late Oscar Buros in 1936 and published by him until 1978. Preparation of subsequent volumes in the series is in the hands of the Buros Institute of Mental Measurements at the University of Nebraska. At this writing, the Seventeenth Mental Measurements Yearbook (Geisinger, Spies, Carlson, & Plake, 2007) has been published and the Eighthteenth can be expected soon. The MMYs provide reviews, by presumably competent and disinterested people, of each published test of any significance. In recent MMY s, the publishers have obtained at least two independent reviews of each new test. (See Thorndike, 1999b, for a thorough description and review of the Twelfth MMY and some cautions.)
The volumes of this series are cumulative. That is, a test is reviewed when it first comes out, and it is generally not reviewed again unless there has been a significant change in the test or the material supporting it, or unless it is a test of unusually widespread and continuing use. Tests in Print VII provides an index giving the volume and page numbers for reviews of tests still in print in the first 16 MMY volumes.
In 1988, the Buros Institute of Mental Measurements began a new schedule for producing the MMY s. That year saw the publication of a Supplement to the Ninth Mental Measurements Year-book (Conoley, Kramer, & Mitchell, 1988), which gave full MMY treatment to 89 new tests that had been published since 1985. The Tenth Mental Measurements Yearbook was published in 1989,and its supplement appeared in 1990. The staff of the Buros Institute announced plans to publish a complete volume in alternate years thereafter, with supplements in the intervening years, but the Eleventh Mental Measurements Yearbook arrived in 1992, followed by its supplement in1994. The Twelfth MMY was published in 1995; the Thirteenth MMY appeared in 1998, followed by a Supplement to the Thirteenth MMY in 1999; the Fourteenth MMY became available in 2001;the Fifteenth MMY appeared in 2003 and the Sixteenth MMY in 2005, so the Buros Institute seems to have achieved its goal. Beginning with the Eleventh MMY, the yearbooks have been available on CD-ROM. This format has the advantage of permitting automated searches, but it is more difficult to flip back and forth between two or more instruments to compare them directly unless you print the reviews in which you are interested. Recently, the biennial Supplement has been re-placed by Test Reviews Online, which includes new reviews of tests as soon as they are completed and edited by the Buros Center staff.
The Buros Institute also maintains a Web site at http://www.unl.edu/buros/. This site has links to other testing-related Web sites, including the ERIC/AE test locator and the American Psychological Association (APA) site. It also has the ability to search for reviews of tests published in either the MMY s or Tests. A classified subject index lists all entries after 1985 (Ninth MMY) where all reviewed tests are listed under one of 19 subject headings. The site does not contain copies of the reviews, but you can determine whether a test has been reviewed and either purchase a copy of the review or find or the review if published.
Each of the 17 MMY s and the supplements include tests of all types. The MMY s and Tests in Print should be available at the reference desk of any good university library, at many larger public libraries, and at the testing bureau in many school systems. The Buros Institute Web site has a link to the Silver Platter Information database where one can obtain copies of reviews for a fee, and one can order fax copies of individual reviews directly from the Buros Institute as well, also for a fee. The MMY s provide the most important source for evaluative reviews of tests. They also include reviews of books and monographs on testing, a listing of test publishers, and nearly complete bibliographies of published material on each of the tests.
Another source of critical reviews of tests is the 10-volume series Test Critiques (Keyser & Sweetland, 1984). Each volume in the series contains a single review of each of several hundred tests. The reviews are often somewhat longer and more detailed than those in the MMY s, but lack the contrasting opinions that multiple reviews provide. Each of the last eight volumes contains a subject index to all preceding ones, but within a single volume, the tests are arranged alphabetically by title without regard to topic. This system may cause difficulty in finding a specific test. However, a Website with search capabilities is available at http://infotree.library.ohiou.edu/single-records/2535.html.Brief reviews of some American tests and a large number of tests published in Great Britainmay be found in Tests in Education: A Book of Critical Reviews (Levy & Goldstein, 1984). The tests are organized into six general categories: early development, language, mathematics, achievement batteries, general abilities, and personality and counseling. Reviews of new and important tests will occasionally be found, along with book reviews, in some psychological and educational journals such as the Journal of Psycho educational Assessment. However, they appear sporadically and include the opinions of only one reviewer. There does not seem to be any journal that has a policy of systematically and regularly re-viewing new published tests. At this time, the best source of evaluative information seems to be the MMY s and their supplements, particularly now that they are updated more regularly in print and on the computer database.
A different approach to test evaluation has been taken by Hammill, Brown, and Bryant(1992) in their A Consumer’s Guide to Tests in Print. This volume contains ratings of seven technical aspects of the quality of a number of published instruments and observations about several nontechnical features of these instruments, as well as an overall rating of quality. The ratings are provided for each subscale of multi score tests, like the Wechsler scales and the Stanford-Binet, and are presented as: A, highly recommended; B, recommended; and F, not recommended. The tabular form is compact, but the organization of the scales by the function measured results in splitting up the subscales of the multi score instruments so that one must consult several pages to get an overall impression of a particular test. Unfortunately, this re-source is now quite dated.
A number of university libraries have also developed Web sites that can help you conduct searches for tests that interest you. One that covers many of the same sources we have described but also gives direct links to useful databases is provided by the library at the California State University at Sacremento at http://library.csus.edu/guides/rogenmoserd/educ/ttests.htm.
What Research Has Been Conducted on Test X?
What studies have been made of the test’s reliability? Of validity as a predictor of Z ? Of influence of coaching on its scores? Of relation to measures of Y ? For the major, commercially produced instruments, of course, some material will appear in the technical manual for the test. Manuals vary widely in the amount of statistical and research information that they report. Some of the better ones become almost a full-length book and report a wide range of analyses of the test’s reliability, correlations with other instruments, and predictive validity for academic or job criteria. The manual will also often include a bibliography, providing references to specific research studies. But, for widely used tests, the manual can hardly provide complete information; for example, in the course of its 95-year history, the Stanford-Binet Intelligence Scale has been used in more than 10,000 studies, and in only 70 years, even more have been carried out with the Minnesota Multiphasic Personality Inventory and the various Wechsler scales. It is also difficult for a manual to be up to date with recent studies. In fact, it is rare for a test publisher to bring out a revision of a test manual that does not correspond to a revision of the test itself, so even the manuals for the best commercially produced tests may be 5 to 10 years old.
As already noted, for published tests, very comprehensive bibliographies appear in the MMY s and Tests in Print. As is the case with the reviews of the tests themselves, these reference sources include only references to articles and books published since their last editions. Even so, for a number of the popular and widely used tests, the bibliography is so extensive, running to hundreds of entries, that it becomes almost unusable. However, these bibliographies permit you to scan through titles with the hope of identifying the ones that deal with the spe-cific problem of interest.
The other main avenue for locating research on tests and testing is through the index and abstract services, which appear as monthly journals and are then combined into annual volumes. The ones most likely to be useful for a person interested in testing are Dissertation Abstracts Inter-national, the Education Index, and Psychological Abstracts. Another resource is the Index to Tests Used in Educational Dissertations (Fabiano, 1989). Dissertation Abstracts International has had an online computer search service since 1980, which has recently been extended to cover American dissertations back to 1861! This reference categorizes more than 50,000 tests alphabetically by title. The entries for each test include the type of examinees, the volume and page number of the abstract, and the author’s name.
Other, more general databases that include many references to tests and testing topics are the Pro Quest Education and Social Science journal databases and the ERIC/AE and Psych Info collections. ERIC/AE includes information about articles in education and related areas, often unpublished or published in obscure sources, while Psych Info provides a computerized version of much of the material in Psychological Abstracts with directions on how to obtain a copy Pro Quest provides direct access to the article in question in text or pdf format. All of these services are available at many university libraries. The ERIC/AE and Psych Info sources contain the complete citation for each article, a set of keywords, and an abstract of the study. They have the advantage that they can search the entire content of each entry for any desired word or phrase. Thus, if you search for a test title, the program will produce a list of all entries that include the name of that test in the title, abstract, or keywords. Searches can also be carried out by author or subject. The half hour or so that it takes to become familiar with these facilities will be time well spent.
Finally, there is the burgeoning world of the Internet. There is no way to tell what will be available in a year or two, but it is clear that Web pages and databases, in addition to the ERIC/AE, APA, and Buros Institute sites, will be available for searching, as will the holdings of many major libraries. Most psychology and education journals have Web pages, as do all of the major professional testing organizations. Notices of Web page addresses are published frequently in the journals and newsletters of the professional organizations. Google is a particularly useful tool, except that it can produce so many hits that selecting the useful ones can be very time-consuming if the search topic is not very specific or the site is not one of the most popular. (A search for ETS Test Collection produced 1,100,000 hits, but the desired site came up as the first entry.)
In using any database, the secret of a successful search is a shrewd selection of keywords that serve to guide the search engine in identifying relevant entries. In some instances, preparers of the database provide a glossary of terms that the computer will recognize. It is the responsibility of the user to generate—or to select from the glossary—the entries most likely to elicit relevant items from the database. Although test titles are unlikely to appear as keywords and may not appear in the abstract, the general traits measured by the instrument may well give access to publications in which the test is referenced.
Generally, a person who wishes to search one of the databases will find it desirable, and per-haps necessary, to work through the reference division of a university or public library. The library is likely to have a computer terminal with direct access to the data files in question. If not, the reference librarian should be able to help arrange such access. There may be a charge for the search, and you will want to get an estimate of any costs before starting this undertaking. Google, of course, is available free to anyone with access to the Internet and may eventually come to re-place the more specific databases.
SUMMARY
In addition to yielding information that is valid for the proposed use, the test must be practical to use. Practicality relates to matters of expense, ease of administration and scoring, and ease with which appropriate inferences can be derived from the test scores, including the adequacy and availability of norm- or criterion-referenced interpretations.
When reviewing a test, it is important to examine all the factors that make for a quality instrument. The test manual or other sources should provide information about the purpose of the test, its reliability and validity, what scales the test provides and what norms are available, and how the test should be administered and scored. When considering which of several tests to use, you will also want to obtain information about what aids are available to assist in score interpretation.
There are several published sources of information about specific tests, including Tests in Print VII, Tests and the Dictionary of Behavioral Assessment Techniques. Copies of unpublished tests can be obtained from Tests in Microfiche, the Dictionary of Unpublished Experimental Mental Measures, and several other more specialized sources. Critical reviews of tests can be found in the several editions of Mental Measurements Yearbook and the multi-volume Test Critiques, as well as in various professional journals and newsletters.