Reliability and validity | Education homework help

ReliabiltyandValidityGuidelinesandRubric.pdf

Page 1 of 3

Reliability and Validity For assessments to be sound, they must be free of bias and distortion. Reliability and validity are two concepts that are important for defining and measuring bias and distortion.

Reliability Reliability refers to the extent to which assessments are consistent. Just as we enjoy having reliable cars (cars that start every time we need them), we strive to have reliable, consistent instruments to measure student achievement. Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for the potatoes an hour later (unless, of course, you peeled and cooked them). Likewise, instruments such as classroom tests and national standardized exams should be reliable – it should not make any difference whether a student takes the assessment in the morning or afternoon; one day or the next.

Another measure of reliability is the internal consistency of the items. For example, if you create a quiz to measure students’ ability to solve quadratic equations, you should be able to assume that if a student gets an item correct, he or she will also get other, similar items correct. The following table outlines three common reliability measures.

Type of Reliability How to Measure

Stability or Test-Retest

Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and at

Time 2.

Alternate Form Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Test 1 and Test 2.

Internal Consistency (Alpha, a)

Compare one half of the test to the other half. Or, use methods such as Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha.

The values for reliability coefficients range from 0 to 1.0. A coefficient of 0 means no reliability and 1.0 means perfect reliability. Since all tests have some error, reliability coefficients never reach 1.0. Generally, if the reliability of a standardized test is above .80, it is said to have very good reliability; if it is below .50, it would not be considered a very reliable test.

Validity Validity refers to the accuracy of an assessment -- whether it measures what it is supposed to measure. Even if a test is reliable, it may not provide a valid measure. Let’s imagine a bathroom scale that consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but it is not accurate (valid) because you really weigh 145 pounds (perhaps

Page 2 of 3

you re-set the scale in a weak moment)! Since teachers, parents, and school districts make decisions about students based on assessments (such as grades, promotions, and graduation), the validity inferred from the assessments is essential -- even more crucial than the reliability. Also, if a test is valid, it is almost always reliable.

There are three ways to measure validity. To have confidence that a test is valid (and therefore the inferences we make based on the test scores are valid), all three kinds of validity evidence should be considered.

Type of Validity Definition Example/Non-Example

Content The extent to which the content of the test matches the instructional

objectives.

A semester or quarter exam that only includes content covered during the last six weeks is not a valid measure of the course's overall objectives -- it

has very low content validity.

Criterion The extent to which scores on the test agree with (concurrent validity) or

predict (predictive validity) an external criterion.

If the end-of-year math tests in 4th grade correlate highly with the

statewide math tests, they would have high concurrent validity.

Construct The extent to which an assessment corresponds to other variables, as

predicted by some rationale or theory.

If you can correctly hypothesize that ESOL students will perform differently

on a reading test than English- speaking students (because of

theory), the assessment may have construct validity.

So, does all this talk about validity and reliability mean you need to conduct statistical analyses on your classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of your peers to verify the content validity of your major assessments.) However, you should be aware of the basic tenets of validity and reliability as you construct your classroom assessments, and you should be able to help parents interpret scores for the standardized exams.

Page 3 of 3

Reflect on the following scenarios

Scenario 1

A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient was reported as .89, and the parent thinks that must be a very low number. How would you explain to the parent that .89 is an acceptable coefficient?

Scenario 2 Your school district is looking for an assessment instrument to measure reading ability. They have narrowed the selection to two possibilities -- Test A provides data indicating that it has high validity, but there is no information about its reliability. Test B provides data indicating that it has high reliability, but there is no information about its validity. Which test would you recommend? Why?

Grading Guidelines

Component Unacceptable Acceptable Target

Accuracy. Answer to each scenario is well- formed and accurate

Both responses are inaccurate indicating a lack of understanding

of the concepts. 0 points.

Response to one of the two scenarios is inaccurate showing

limited understanding of the concept. 5

points.

Complete, well-formed and accurate response to both scenarios that

show a thoughtful, reflective, in-depth

understanding of the concepts. 10 points.

Evidence. Evidence provided in the text

and provided document is cited.

Failed to support responses to both

scenarios with evidence from the text or provided document.

0 points.

Response to one of the two scenarios is

not supported by accurate and specific evidence cited from

the text or the provided document. 5

points.

Each response is supported by accurate and specific evidence cited from the text or

the provided document. 10 points.

Professional Presentation.

Spelling, grammar, mechanics.

Writing involves many grammatical errors

(more than 3). 0 points.

Writing involves few grammatical errors (no more than 2). 2 points.

Writing is free of all writing errors. 5 points.

Reliability and Validity

Reliability
Validity
Reflect on the following scenarios

Scenario 1
A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient was reported as .89, and the parent thinks that must be a very low number. How would you explain to the parent that .89 is an acceptable coeffi...
Scenario 2

Grading Guidelines

CRTsvs.NRTs.pdf

Page 1 of 2

Differences between CRTs and NRTs

Many educators and members of the public fail to grasp the distinctions between criterion- referenced and norm-referenced testing. It is common to hear the two types of testing referred to as if they serve the same purposes, or shared the same characteristics. Much confusion can be eliminated if the basic differences are understood.

The following is adapted from: Popham, J. W. (1975). Educational evaluation. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.

CRTs versus NRTs

Dimension Criterion-Referenced Tests Norm-Referenced Tests

Purpose A grade-level was not identified To determine

whether each student has achieved specific skills or

concepts.

To find out how much students know before instruction begins

and after it has finished.

To rank each student with respect to the achievement of

others in broad areas of knowledge.

To discriminate between high and low achievers.

Content

Measures specific skills which make up a designated

curriculum. These skills are identified by teachers and

curriculum experts.

Each skill is expressed as an instructional objective.

Measures broad skill areas sampled from a variety of textbooks, syllabi, and the judgments of curriculum

experts.

Item Characteristics

Each skill is tested by at least four items in order to obtain an

adequate sample of student performance and to minimize

the effect of guessing.

The items which test any given skill are parallel in difficulty.

Each skill is usually tested by less than four items.

Items vary in difficulty.

Items are selected that discriminate between high and

low achievers.

Page 2 of 2

Dimension Criterion-Referenced Tests Norm-Referenced Tests

Score Interpretation

Each individual is compared with a preset standard for

acceptable achievement. The performance of other

examinees is irrelevant.

A student's score is usually expressed as a percentage.

Student achievement is reported for individual skills.

Each individual is compared with other examinees and assigned a

score--usually expressed as a percentile, a grade equivalent

score, or a stanine.

Student achievement is reported for broad skill areas,

although some norm-referenced tests do report student

achievement for individual skills.

ReliabiltyandValidityGuidelinesandRubric.pdf

Page 1 of 3

Reliability and Validity For assessments to be sound, they must be free of bias and distortion. Reliability and validity are two concepts that are important for defining and measuring bias and distortion.

Reliability Reliability refers to the extent to which assessments are consistent. Just as we enjoy having reliable cars (cars that start every time we need them), we strive to have reliable, consistent instruments to measure student achievement. Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for the potatoes an hour later (unless, of course, you peeled and cooked them). Likewise, instruments such as classroom tests and national standardized exams should be reliable – it should not make any difference whether a student takes the assessment in the morning or afternoon; one day or the next.

Another measure of reliability is the internal consistency of the items. For example, if you create a quiz to measure students’ ability to solve quadratic equations, you should be able to assume that if a student gets an item correct, he or she will also get other, similar items correct. The following table outlines three common reliability measures.

Type of Reliability How to Measure

Stability or Test-Retest

Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and at

Time 2.

Alternate Form Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Test 1 and Test 2.

Internal Consistency (Alpha, a)

Compare one half of the test to the other half. Or, use methods such as Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha.

The values for reliability coefficients range from 0 to 1.0. A coefficient of 0 means no reliability and 1.0 means perfect reliability. Since all tests have some error, reliability coefficients never reach 1.0. Generally, if the reliability of a standardized test is above .80, it is said to have very good reliability; if it is below .50, it would not be considered a very reliable test.

Validity Validity refers to the accuracy of an assessment -- whether it measures what it is supposed to measure. Even if a test is reliable, it may not provide a valid measure. Let’s imagine a bathroom scale that consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but it is not accurate (valid) because you really weigh 145 pounds (perhaps

Page 2 of 3

you re-set the scale in a weak moment)! Since teachers, parents, and school districts make decisions about students based on assessments (such as grades, promotions, and graduation), the validity inferred from the assessments is essential -- even more crucial than the reliability. Also, if a test is valid, it is almost always reliable.

There are three ways to measure validity. To have confidence that a test is valid (and therefore the inferences we make based on the test scores are valid), all three kinds of validity evidence should be considered.

Type of Validity Definition Example/Non-Example

Content The extent to which the content of the test matches the instructional

objectives.

A semester or quarter exam that only includes content covered during the last six weeks is not a valid measure of the course's overall objectives -- it

has very low content validity.

Criterion The extent to which scores on the test agree with (concurrent validity) or

predict (predictive validity) an external criterion.

If the end-of-year math tests in 4th grade correlate highly with the

statewide math tests, they would have high concurrent validity.

Construct The extent to which an assessment corresponds to other variables, as

predicted by some rationale or theory.

If you can correctly hypothesize that ESOL students will perform differently

on a reading test than English- speaking students (because of

theory), the assessment may have construct validity.

So, does all this talk about validity and reliability mean you need to conduct statistical analyses on your classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of your peers to verify the content validity of your major assessments.) However, you should be aware of the basic tenets of validity and reliability as you construct your classroom assessments, and you should be able to help parents interpret scores for the standardized exams.

Page 3 of 3

Reflect on the following scenarios

Scenario 1

A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient was reported as .89, and the parent thinks that must be a very low number. How would you explain to the parent that .89 is an acceptable coefficient?

Scenario 2 Your school district is looking for an assessment instrument to measure reading ability. They have narrowed the selection to two possibilities -- Test A provides data indicating that it has high validity, but there is no information about its reliability. Test B provides data indicating that it has high reliability, but there is no information about its validity. Which test would you recommend? Why?

Grading Guidelines

Component Unacceptable Acceptable Target

Accuracy. Answer to each scenario is well- formed and accurate

Both responses are inaccurate indicating a lack of understanding

of the concepts. 0 points.

Response to one of the two scenarios is inaccurate showing

limited understanding of the concept. 5

points.

Complete, well-formed and accurate response to both scenarios that

show a thoughtful, reflective, in-depth

understanding of the concepts. 10 points.

Evidence. Evidence provided in the text

and provided document is cited.

Failed to support responses to both

scenarios with evidence from the text or provided document.

0 points.

Response to one of the two scenarios is

not supported by accurate and specific evidence cited from

the text or the provided document. 5

points.

Each response is supported by accurate and specific evidence cited from the text or

the provided document. 10 points.

Professional Presentation.

Spelling, grammar, mechanics.

Writing involves many grammatical errors

(more than 3). 0 points.

Writing involves few grammatical errors (no more than 2). 2 points.

Writing is free of all writing errors. 5 points.

Reliability and Validity

Reliability
Validity
Reflect on the following scenarios

Scenario 1
A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient was reported as .89, and the parent thinks that must be a very low number. How would you explain to the parent that .89 is an acceptable coeffi...
Scenario 2

Grading Guidelines

CRTsvs.NRTs.pdf

Page 1 of 2

Differences between CRTs and NRTs

Many educators and members of the public fail to grasp the distinctions between criterion- referenced and norm-referenced testing. It is common to hear the two types of testing referred to as if they serve the same purposes, or shared the same characteristics. Much confusion can be eliminated if the basic differences are understood.

The following is adapted from: Popham, J. W. (1975). Educational evaluation. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.

CRTs versus NRTs

Dimension Criterion-Referenced Tests Norm-Referenced Tests

Purpose A grade-level was not identified To determine

whether each student has achieved specific skills or

concepts.

To find out how much students know before instruction begins

and after it has finished.

To rank each student with respect to the achievement of

others in broad areas of knowledge.

To discriminate between high and low achievers.

Content

Measures specific skills which make up a designated

curriculum. These skills are identified by teachers and

curriculum experts.

Each skill is expressed as an instructional objective.

Measures broad skill areas sampled from a variety of textbooks, syllabi, and the judgments of curriculum

experts.

Item Characteristics

Each skill is tested by at least four items in order to obtain an

adequate sample of student performance and to minimize

the effect of guessing.

The items which test any given skill are parallel in difficulty.

Each skill is usually tested by less than four items.

Items vary in difficulty.

Items are selected that discriminate between high and

low achievers.

Page 2 of 2

Dimension Criterion-Referenced Tests Norm-Referenced Tests

Score Interpretation

Each individual is compared with a preset standard for

acceptable achievement. The performance of other

examinees is irrelevant.

A student's score is usually expressed as a percentage.

Student achievement is reported for individual skills.

Each individual is compared with other examinees and assigned a

score--usually expressed as a percentile, a grade equivalent

score, or a stanine.

Student achievement is reported for broad skill areas,

although some norm-referenced tests do report student

achievement for individual skills.

ReliabiltyandValidityGuidelinesandRubric.pdf

Page 1 of 3

Reliability and Validity For assessments to be sound, they must be free of bias and distortion. Reliability and validity are two concepts that are important for defining and measuring bias and distortion.

Reliability Reliability refers to the extent to which assessments are consistent. Just as we enjoy having reliable cars (cars that start every time we need them), we strive to have reliable, consistent instruments to measure student achievement. Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for the potatoes an hour later (unless, of course, you peeled and cooked them). Likewise, instruments such as classroom tests and national standardized exams should be reliable – it should not make any difference whether a student takes the assessment in the morning or afternoon; one day or the next.

Another measure of reliability is the internal consistency of the items. For example, if you create a quiz to measure students’ ability to solve quadratic equations, you should be able to assume that if a student gets an item correct, he or she will also get other, similar items correct. The following table outlines three common reliability measures.

Type of Reliability How to Measure

Stability or Test-Retest

Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and at

Time 2.

Alternate Form Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Test 1 and Test 2.

Internal Consistency (Alpha, a)

Compare one half of the test to the other half. Or, use methods such as Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha.

The values for reliability coefficients range from 0 to 1.0. A coefficient of 0 means no reliability and 1.0 means perfect reliability. Since all tests have some error, reliability coefficients never reach 1.0. Generally, if the reliability of a standardized test is above .80, it is said to have very good reliability; if it is below .50, it would not be considered a very reliable test.

Validity Validity refers to the accuracy of an assessment -- whether it measures what it is supposed to measure. Even if a test is reliable, it may not provide a valid measure. Let’s imagine a bathroom scale that consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but it is not accurate (valid) because you really weigh 145 pounds (perhaps

Page 2 of 3

you re-set the scale in a weak moment)! Since teachers, parents, and school districts make decisions about students based on assessments (such as grades, promotions, and graduation), the validity inferred from the assessments is essential -- even more crucial than the reliability. Also, if a test is valid, it is almost always reliable.

There are three ways to measure validity. To have confidence that a test is valid (and therefore the inferences we make based on the test scores are valid), all three kinds of validity evidence should be considered.

Type of Validity Definition Example/Non-Example

Content The extent to which the content of the test matches the instructional

objectives.

A semester or quarter exam that only includes content covered during the last six weeks is not a valid measure of the course's overall objectives -- it

has very low content validity.

Criterion The extent to which scores on the test agree with (concurrent validity) or

predict (predictive validity) an external criterion.

If the end-of-year math tests in 4th grade correlate highly with the

statewide math tests, they would have high concurrent validity.

Construct The extent to which an assessment corresponds to other variables, as

predicted by some rationale or theory.

If you can correctly hypothesize that ESOL students will perform differently

on a reading test than English- speaking students (because of

theory), the assessment may have construct validity.

So, does all this talk about validity and reliability mean you need to conduct statistical analyses on your classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of your peers to verify the content validity of your major assessments.) However, you should be aware of the basic tenets of validity and reliability as you construct your classroom assessments, and you should be able to help parents interpret scores for the standardized exams.

Page 3 of 3

Reflect on the following scenarios

Scenario 1

A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient was reported as .89, and the parent thinks that must be a very low number. How would you explain to the parent that .89 is an acceptable coefficient?

Scenario 2 Your school district is looking for an assessment instrument to measure reading ability. They have narrowed the selection to two possibilities -- Test A provides data indicating that it has high validity, but there is no information about its reliability. Test B provides data indicating that it has high reliability, but there is no information about its validity. Which test would you recommend? Why?

Grading Guidelines

Component Unacceptable Acceptable Target

Accuracy. Answer to each scenario is well- formed and accurate

Both responses are inaccurate indicating a lack of understanding

of the concepts. 0 points.

Response to one of the two scenarios is inaccurate showing

limited understanding of the concept. 5

points.

Complete, well-formed and accurate response to both scenarios that

show a thoughtful, reflective, in-depth

understanding of the concepts. 10 points.

Evidence. Evidence provided in the text

and provided document is cited.

Failed to support responses to both

scenarios with evidence from the text or provided document.

0 points.

Response to one of the two scenarios is

not supported by accurate and specific evidence cited from

the text or the provided document. 5

points.

Each response is supported by accurate and specific evidence cited from the text or

the provided document. 10 points.

Professional Presentation.

Spelling, grammar, mechanics.

Writing involves many grammatical errors

(more than 3). 0 points.

Writing involves few grammatical errors (no more than 2). 2 points.

Writing is free of all writing errors. 5 points.

Reliability and Validity

Reliability
Validity
Reflect on the following scenarios

Scenario 1
A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient was reported as .89, and the parent thinks that must be a very low number. How would you explain to the parent that .89 is an acceptable coeffi...
Scenario 2

Grading Guidelines