Week 2 -- PPOL-505 Exercise 1
We live in a world where information is being increasingly quantified. Reviewers rate movies with stars, newspapers report the cost of living in the “world’s most expensive cities,” and international watchdog organizations rank countries from the most corrupt to the least corrupt. We refer to this process of quantification as measurement. Consider the importance of measurement both in everyday life and to organizations. People use reviewer ratings to decide what movie to see. Tourists use cost-of-living reports to decide what cities to visit. Businesses use information on corruption to decide what international businesses to engage as partners. Organizations use data to identify community needs, to track performance, to evaluate programs, or to learn about clients. An organization may report data to influence policy decisions or to demonstrate accountability. In some cases, such as identifying community needs, the organization may use existing data. In other cases it will develop its own measures and collect the data.
In this chapter we take you through the steps required to decide what you want to measure and determine the quality of measures you use. Later chapters build on this knowledge. Chapter 4 introduces performance monitoring; you will learn strategies for identifying measures to describe individual and organizational performance. Chapter 5 has a brief section on levels of measurement, which is linked to question wording and the choice of statistics. In Chapter 7 we turn to writing questions and questionnaires. Whether you conduct a survey or design forms the specific wording of questions and items directly affect the quality of your measures and the value of your data.
CONCEPTUAL AND OPERATIONAL DEFINITIONS
As we begin to develop measures we may label a variable as a concept and the statement that describes what we mean by the concept as a conceptual definition. Let’s begin with a homey example. What defines a “good restaurant”? Is it one that serves large portions for a relatively low price? Is it one that has interesting, even cutting-edge, food? Or do you look for atmosphere? If you think about your experiences in asking for restaurant recommendations you will understand important characteristics of a conceptual definition. That is, there is no one correct definition of a concept. Whether a definition is appropriate or not largely depends on the users and how they plan to use the information.
The first step is to learn how critical stakeholders define what will be measured and how they plan to use the information. If an adult literacy program manager asked you to help document its effectiveness you might first search for a credible definition of adult literacy. National assessments of literacy use the following conceptual definition: “The ability to use printed and written information to function in society, to achieve one’s goals, and to develop one’s knowledge and potential,” including the ability to read prose, handle forms, and perform arithmetic calculations. 1 The measures based on this definition have been used to track adult literacy in the United States and other countries, but this definition may be of limited use to a specific literacy program. Next you might meet with program staff and ask them what they want to achieve. You could ask staff members how they define literacy, how program activities should add to client literacy, and how they know that a program has been successful. You might ask them how their definitions compare with what clients, funders, and donors expect the program to achieve. Not all definitions are relevant to your study. Definitions, such as those used in national assessments, may help the staff organize its thinking, but they may not measure what a specific program is trying to achieve.
Assume that the staff agrees that they define literacy as “the parents’ ability to read to their children and engage them in conversations about the stories.” You would then find a way to measure the parents’ ability to read the stories, understand the stories, and talk about the stories. Let’s just focus on the ability to read stories and understand them. You might present the clients with a few very short stories, ask them to read the stories aloud, and ask a few questions about each story. You would develop a guide to score each client’s ability to read the story and answer the questions. The short stories, the questions asked, and the scoring guide constitute an operational definition of literacy. In other words, the operational definition gives you the complete picture of how clients’ literacy was determined. The stories included in the operational definition may seem appropriate given the criteria of word length and sentence structure. But a problem occurs if the stories are about unfamiliar objects. A story about farm life may confuse readers with no knowledge of farm buildings, crops, and various animals. To avoid such problems you must conduct a pretest. You should ask persons similar to the program’s clients to read the stories and answer the questions. The importance of conducting a pretest is a message that can’t be repeated often enough.
Honestly, in the real world, you will seldom encounter the term conceptual definition, but you should not ignore its importance. It is the same as asking, “What do we mean by X [the concept or term of interest]?” Too often, once an operational definition is stated people focus on its technical details. They may labor over the wording of the questions and responses. They may not stop to consider if they are measuring what they want to measure.
Let’s work through another example focusing on how an organization can evaluate its volunteer orientation program. First, the staff should decide what it wants to accomplish. Should the orientation introduce potential volunteers to the organization, its mission, its history, and its values? Will participants be trained to carry out specific tasks? Are participants expected to sign up as volunteers and recruit other volunteers? Once the goals are decided they are incorporated into a conceptual definition. The conceptual definition serves as a blueprint for deciding what to ask participants at the end of an orientation. The shaded box below uses orientation quality as an example of a conceptual definition and an operational definition and to illustrate their relationship.
Measuring the Quality of Volunteer Orientation
Variable: Perceived Quality of Volunteer Orientation
Conceptual Definition: Orientation attendees understand the organization, value its services, are motivated to volunteer, and feel prepared to work with clients.
Operational Definition, Part 1.
Ask each participant to fill out a form that includes the following items:
For each statement indicate the response that describes your opinion:
1. I understand the mission of [name of organization].
Responses: Strongly agree, agree, neither agree nor disagree, disagree, strongly disagree
2. I can describe the services [name of organization] offers.
Responses: Strongly agree, agree, neither agree nor disagree, disagree, strongly disagree
3. I feel qualified to refer clients to [name of organization].
Responses: Strongly agree, agree, neither agree nor disagree, disagree, strongly disagree
4. I am comfortable with recruiting others to volunteer with [name of organization].
Responses: Strongly agree, agree, neither agree nor disagree, disagree, strongly disagree
5. I have signed up to be a volunteer with [name of organization].
Responses: Strongly agree, agree, neither agree nor disagree, disagree, strongly disagree
6. I feel uncomfortable about working with clients.
Responses: Strongly agree, agree, neither agree nor disagree, disagree, strongly disagree
Operational Definition, Part II
Responses to Questions 1 through 5 are scored 5 = strongly agree, 4 = agree, 3 = neither agree nor disagree, 2 = disagree, 1= strongly disagree. Question 5 uses the opposite scoring: 1 = strongly agree through 5 = strongly disagree. The sum of the six questions represents the perceived quality of orientation. The scores can range from 30 (the highest quality) to 6 (the lowest).
An operational definition determines what we actually measure. In the previous example we learned the perceptions of people who attended orientation. We do not learn if their understanding of the organization and its programs is accurate. We do not know if they are really prepared (or unprepared) to work with clients. We may not know if they actually signed up or actually volunteered. To get that information we would use a different operational definition. For some items we might ask participants specific questions about the organization, its services, and volunteer tasks.
Let’s stop here for a word of caution about the pragmatism of operational definitions. You may be tempted to use questions to evaluate volunteer orientation programs that you find on the Internet or obtain from a friend. But a measure that has been designed to assess a general orientation for a large nonprofit with diverse programs, such as the Red Cross or YMCA, might not be appropriate to assess an orientation for volunteers who will work with survivors of domestic violence. Even if a measure has documentation establishing its quality, this does not mean it is the right measure for your study. You need to consider your purpose and the characteristics of the individuals or organizations who will supply the requested information.
Before you put a question on a survey or interview guide you should establish its quality. First, you may ask, “Is the measured difference between subjects a real difference?” or “Did measured changes over time really occur?” These questions address the reliability of a measure. Second, you may ask, “Does this measure actually produce information on the concept or variable of interest?” This question addresses the operational validity of a measure. Third, you may ask, “Is this measure sufficiently precise?” This question addresses its sensitivity.
Reliability evaluates the consistency of a measure. Differences over time or between subjects may be due to random error. Random errors are just that, random events or features that affect your findings. Random errors occur because of respondent characteristics, the measure itself, or the process of assigning values. Uninterested or distracted respondents introduce errors when they answer questions rapidly without listening or thinking. Respondents may be inconsistent as they answer items with ambiguous terms or inadequate or unsuitable response choices. Raters may be inconsistent in how they assign values or record answers. Random errors cannot be entirely eliminated, but you want to make sure that they are not undermining the value of your data. A measure that yields a large amount of random error should be discarded.
Dimensions of Reliability: Reliability has three dimensions. A reliable measure should have stability, equivalence, and internal consistency. A stable measure yields different results over time only if the phenomenon being measured has changed. Consider choosing scholarship recipients. To ease its work and assure a fair selection process, the selection committee may create a measure to rate the applications. For each application a rater adds up the values of separate items that measure experience, leadership, achievements, and potential future contributions. The assigned values or ratings should not change with the rater’s mood, fatigue, or degree of attention. To make sure the ratings are consistent the raters can re-rate a random sample of applications. If their original ratings and second ratings are the same or nearly the same, the measure may be deemed stable.
An equivalent measure yields different results only if the differences between subjects are real differences. For example, if you and some colleagues use the same measure all of you should assign the same value to the same person or case. If several people are to rate the scholarship applicants they should first rate a sample of applications. The sample should represent a variety of applications, including applicants with diverse backgrounds or applications with wordy answers. If the raters give the same or nearly the same ratings to each application we can assume that the differences between subjects are real differences and not due to differences in the raters. If the raters give different scores the measure has to be examined to identify and resolve problems.
Equivalence also applies if two or more versions of a measure are used. Consider written driving tests. If each applicant takes the same test, cheating becomes a problem, so numerous versions are created. An equivalent measure would ensure that comparable test takers receive the same score no matter which test version they take. Developing alternate versions, however, is time consuming and rarely done except when designing a test for large numbers of people.
Internal consistency applies to measures with multiple items. It establishes if the items are empirically related to each other. A simple example may help explain internal consistency. Imagine a measure of arithmetic skills with five multiple-choice questions. The questions are
1. 35 + 83 =?
2. 83 – 35 =?
3. 83/23 =?
4. 1.83 × 23 =?
5. 5e3 =?
Let us assume that items 1 to 4 relate to arithmetical ability and that item 5 does not. Respondents who have good computational skills may easily answer the first four items, and guess the answer to item 5. Their guesses represent random error. If item 5 was dropped the measure would be more internally consistent.
Qualitative Evidence of Reliability: The next question to ask yourself is, “How much random error should I tolerate?” Think about the examples we have used in this chapter—adult literacy, quality of volunteer orientation, and scholarship ratings. Which measure should have the least random error? Why did you choose this measure? To decide how much random error to tolerate, you should consider its consequences. Among our three examples we would want to have the least error in choosing scholarship recipients, since the reliability of the measure will affect who receives the scholarship and who doesn’t. The decisions based on the other two measures will neither benefit nor harm an individual. Now think about driving tests. Given the large number of drivers and the potential of an unsafe driver to cause great harm to others the need for minimum random error is even greater.
To estimate reliability start with a qualitative approach. Qualitative methods do not precisely estimate the amount of random error. Hence, you may underestimate their value. However, at a relatively low cost, they can identify problems that will thoroughly discredit a poor measure. You should review a measure’s operational definition to decide whether
■ terms have been defined precisely;
■ ambiguous items or terms have been eliminated;
■ information is accessible to respondents;
■ multiple-choice responses cover all probable responses;
■ directions are clear and easy to follow.
A few examples will demonstrate the problems introduced if a measure lacks one or more of these characteristics. Recall that an unreliable measure does not detect actual differences between subjects or in the same subject over time. Before you ask individuals if they are homeless or ask an agency how many of its clients are homeless you need to define homeless. Homeless persons may include persons living on the streets, in a shelter, doubled up with relatives or friends, or couch surfing. Unless you have defined the term some people who are living with relatives “until something comes along” will say they are homeless and others won’t. The same may be true of young people who are sleeping on friends’ sofas. If you do not define what you mean by homeless, differences between people who say they are homeless and those who don’t may not be actual differences but differences in how they interpreted the question.
Gathering data about family size or earnings may seem straightforward, but both are fraught with ambiguities. If you ask “What is the size of your family?” you may be asked if half-brothers and sisters should be counted. What about divorced parents or siblings who live in different households? What about relatives or their partners who are part of the household? Similarly, asking members of a diverse population how much they earn is tricky. If a time period is not indicated respondents may report hourly, weekly, monthly, or yearly earnings. A person who reports “$12,000” may be reporting her monthly or her annual earnings. On the other hand seasonal workers and self-employed persons may not know how much they earn in a year. Similarly, the accuracy of estimates of family income may be wildly inaccurate depending on which family member is asked.
To reduce the random errors and increase reliability you should review the definition of terms, the clarity of items, the accessibility of information, and the appropriateness of response choices. Your review of the operational definition should involve comments from potential respondents. Otherwise you risk falsely assuming that actual respondents will interpret the items the same way you do. Remember, don’t be presumptuous. Spell out what you want to know.
Additionally, the reliability of measures also depends on interviewers and data entry staff. They must be trained and supervised to minimize inconsistent decisions. For example, all interviewers should define family the same way and use the same procedures for resolving problems of inaccessible information or ambiguous questions. Similarly, staff entering data on a spreadsheet may be uncertain how to handle forms on which respondents checked more than one response, embellished a response category with their own comments, or added an alternative response. If individuals decide on a case-by-case basis how to handle ambiguous responses, the decisions may be inconsistent and the data less reliable.
Quantitative Evidence of Reliability: All measures should receive a qualitative review before they are used to collect data. In addition mathematical procedures may be used to estimate reliability. Specific tests estimate the measurement error associated with stability, equivalence, and internal consistency. We will limit our discussion to tests that you may find useful and easy to implement and ones that are commonly reported in research articles. If you plan to construct or work extensively with job tests, achievement tests, or personality tests you need to become familiar with other methods for mathematically estimating reliability.
Test–retest establishes a measure’s stability. You should conduct a test-retest before collecting data. You need to collect the data at two points in time and you should expect that what is being measured will not have changed. Let’s consider the scholarship selection committee. To establish a measure’s stability you may ask the scholarship selection committee to rerate some of the applications. If the raters give the same scores, you cannot automatically assume that the rating measure is reliable. You have to verify that raters did not remember and reproduce their original ratings. If the scores are different you may assume that the instrument is unreliable. After the first round of ratings the raters may have discussed the measure and changed how they valued the various items. The information on the applications will not have changed. The perceptions and ratings of the raters changed. Using test-retest to establish stability is difficult insofar as an instrument itself may be the source of a change in values.
Inter-rater reliability establishes the equivalence of a measure. Inter-rater reliability should be established if two or more people are collecting data. Here, observers apply the measure to the same phenomena and independently record the scores; next, their scores are compared. Our example of several people rating the scholarship applicants and comparing their ratings illustrates inter-rater reliability. You may apply inter-rater reliability to staff training. Imagine training housing inspectors. Trainees may inspect and rate a sample of dwellings. The trainee may be approved to work once her ratings agree with the ratings of experienced inspectors.
Tests of internal consistency establish the homogeneity of a measure. Internal consistency is used only if a measure consists of several items. The key statistic to assess internal consistency is Cronbach’s alpha. The closer alpha is to 1.0 the more internally consistent the measure. If alpha is close to 0.0, the items are unrelated to one another. As is true with many statistics, there are no set criteria for an acceptable alpha. When a research report indicates a measure’s reliability it normally refers to a test of internal consistency.
At a minimum you want to verify that the data are free of gross errors. To do so, start with a qualitative review of a measure’s operational definition. The review will identify serious threats to reliability. Depending on how a measure will be used you may consider a quantitative test as well. On reading research reports you are most likely to see references to inter-rater reliability or internal consistency.
Operational validity produces evidence that a measure is correctly named and accurately describes what it is measuring. Let’s consider poverty. In the United States a household is under the poverty threshold if its income is less than three times the cost of a minimally adequate diet. The multiplier of three comes from a 1955 finding that a typical family spent one-third of its income on food. The challenge of measuring poverty illustrates several important points. First, operational validity depends on planned use. A measure of poverty may be used to track changes in poverty over time, determine who should receive financial assistance, or describe the effect of poverty on individuals and their community. The poverty threshold has the advantage of being reliable, and changes from year to year may be assumed to be real changes. It may not, however, be a fair criterion to decide if a household needs financial assistance.
Second, operational validity depends on the content of the measure and relevance. A measure of poverty may take into account the cost of food, housing, and medical care. Other possible content includes the cost of child care and transportation. The relevance depends on how the data will be used and interpreted. Over time the poverty threshold has become less relevant, because food consumes far less than one-third of most family budgets.
Third, operational validity always involves judgment. Some people may argue that a poverty measure should consider only if households can meet the most basic needs. Others may argue that education, child care, or transportation should be included, because without them, families may stay impoverished. It is not possible to prove which is the more valid. Judgment also comes into play when measures are modified. Users may have to decide whether to sacrifice comparability or reliability in order to have a more relevant measure.
An important measure, one that will impact decisions, inevitably spawns disagreement about its validity. 2 Should a measure of poverty account for entitlements, such as food stamps? Is a measure of good government biased toward the values of businesses? Does gross domestic product (GDP) measure progress if it ignores pollution and social inequities?
Designing a measure takes considerable work, skill, and thought. While you cannot prove that a measure is operationally valid, you can produce evidence that its content is representative and it is relevant. In the following sections, we discuss the evidence based on a measure’s content, its relationship to other measures, and its consequences.
Evidence Based on Content: Establishing evidence that a measure is valid begins with the conceptual definition. The appropriate content may emerge later in discussions with stakeholders about what the concept means to them and how they plan to use the data. The next step is to design an operational definition that represents the conceptual definition. Let’s go back to our example of the quality of volunteer orientation. The conceptual definition was that attendees would understand the organization, value its services, be motivated to volunteer, and feel prepared to work with clients. We chose six items to measure the quality of orientation. For each item the respondent used a scale ranging from “strongly agree” to “strongly disagree.”
Understand the organization:
I understand the mission of [name of organization].
I can describe the services [name of organization] offers.
Value its services:
I feel qualified to refer clients to [name of organization].
I am comfortable with recruiting other volunteers to [name of organization].
Motivated to volunteer
I have signed up to be a volunteer with [name of organization].
Prepared to work with clients
I feel uncomfortable about working with clients.
You have to compile a set of items that you believe adequately captures the conceptual definition. We will leave it to you to decide if the six items listed here are representative. It is a matter of judgment. To ask about every aspect of orientation may require many more questions—far more questions than respondents would want to answer or staff would want to analyze.
As we noted earlier, the operational definition measures attendees’ perception and does not confirm that they understand the organization’s mission and its services, are capable of working with clients, or are willing to actually volunteer. Thus, you cannot assume that attendees’ perceptions are a valid measure of the actual quality of orientation.
A measure’s relevance is directly linked to its intended use. The six items provide a snapshot of attendees’ opinions about orientation. They may identify weak spots in orientation. However, the items may be inadequate in a different context for example, if the information is meant to guide revisions in the orientation program—keeping what works and dropping what is ineffective. Also, the decision to measure perception rather than objective measures of quality may depend on how the data will be used and interpreted. Creating and administering an objective measure may seem infeasible or unnecessary for its intended use.
Evidence Based on Relationship to Other Variables: Reviewing a measure’s content does not tell the whole story. You will learn a lot if you compare the data from one measure with data from another measure. We will refer to other measures as criteria. You might select a similar validated measure as a criterion. For example, an adult literacy program manager may compare instructors’ assessments of the reading ability of individual students with their performance on standardized reading tests. If the instructors’ assessment agrees with the standardized test scores the program can forego the cost of obtaining and administering standardized tests. The program will have evidence that instructor assessments are a valid measure of client reading ability.
You may select independent assessments as criteria that examine self-assessments and perceptions. Self-reports of health may be compared with a physical exam or a person’s medical records. Employee assessments of their job performance may be compared with agency records. Client assessment of their reading ability may be compared with their instructors’ assessments.
You may select criteria that are logically linked to the measure. You would expect people who gave orientation a high rating to volunteer more than those who didn’t. Likewise, you would expect people who are categorized as poor to have a less nutritious diet than other people. You would expect agencies with high client satisfaction to perform better than ones with low client satisfaction.
Finding a weak relationship between a measure and a criterion may be disconcerting, but it may start a valuable exploration of what information a measure is providing and what it is not providing. We may find that people say that they are healthier than what their medical records indicate. Because we cannot argue that perceived health validly measures a person’s health, the information is valuable. A measure of perceived health can help explain what motivates people to seek medical care or why they ignore serious symptoms.
Another type of criteria is a future outcome. Consider selecting scholarship applicants. Assume that a scholarship’s purpose was to train future leaders. The committee chose and measured items that it believed would identify applicants who use the training and become leaders. To estimate if their selection criteria were appropriate the committee might track the career outcomes of applicants. Similarly, organizations would want to choose employees who will perform well in their jobs. From time to time they may review their selection measures to see if they still capture the quality of employee performance. From examining the outcomes of scholarship selection and employee performance we can learn if people who were expected to be successful were successful. If the criterion includes only qualified and selected people, we will not know if applicants who were rejected based on the criterion would have been equally good or better.
The Consequences of Applying a Measure: Traditionally, operational validity asks if you are measuring what you intend to measure. A closely related question is what are the consequences of applying a measure. The more importance stakeholders place on the data a measure produces the more they should consider its consequences. Administrators and policy makers assume positive benefits when they first adopt a measure. They should confirm that the positive benefits were realized and identify any unintended consequences. The No Child Left Behind Act is a vivid example of the consequence of measures. The act was intended to ensure that all American school children are proficient in mathematics and reading. It called for annual testing of children in third through eighth grades. If a school did not meet statewide performance standards, it was expected to take remedial action. The policy makers assumed that the tests would motivate schools to hire qualified teachers, adopt effective teaching practices, and assist children in danger of failing. Critics have argued that the act’s emphasis on test scores has led states to create easier tests and increase the pass rate, shifted the curriculum away from developing critical thinking, discouraged teachers from teaching in low-achieving schools, and encouraged low-achieving students to drop out. 3 This example reminds us that what is being measured can affect behavior. Sometimes the impact is not what we expect or want.
The sensitivity of a measure refers to its precision or calibration. A sensitive measure has sufficient values to detect relevant variations among respondents; the degree of variation captured by a measure should be appropriate to the purpose of the study. Measures that are reliable may not necessarily detect important differences. Consider a salary survey. Suppose employees were asked the following question:
What is your annual salary for this year? (check appropriate category)
_____ Less than $25,000
_____ $25,000 to $39,999
_____ $40,000 to $54,999
_____ $55,000 to $69,999
_____ $70,000 to $84,999
_____ $85,000 or more
These categories may be adequate to summarize the earnings of all employees, but not the salaries of senior managers. If most top managers earn at least $85,000, all the responses will fall in one category. You may not be able to learn if salary is related to performance.
The sensitivity of a measure may depend on the homogeneity of respondents. For example, a job-satisfaction measure developed for organizations employing unskilled and skilled laborers; clerical workers; and technical, administrative, and professional staff may be a poor choice for an organization largely made up of professionals. If individual differences are of interest, the measure would not be sufficiently sensitive to compare differences among employees in the more homogeneous group.
No measure can describe fully a concept of interest. At best a measure allows us to estimate what program clients achieve, volunteer and employee productivity, or the number of poor. The information a measure provides has great value, but you should also recognize each measure’s limitations.
Measurement begins with a conceptual definition. The conceptual definition indicates what stakeholders mean by a concept and how they plan to use the data. The conceptual definition serves as a blueprint for the operational definition, which details exactly how a concept or variable is measured and its values are determined. The conceptual definition will be particularly important to decide if the measure’s content is representative and valid.
You should establish the reliability, operational validity, and sensitivity of all measures before data collection begins. Reliable measures allow you to conclude that differences between subjects or over time are real differences, and not due to the measure or the measuring process. A qualitative review of a proposed operational definition will markedly improve reliability and may be adequate. You should verify that directions are clear and easy to follow, that items are clearly defined, that given responses cover all likely responses, and that respondents can provide the requested information. People responsible for collecting or processing data should be trained to avoid inconsistencies. If knowing and limiting the amount of random error are important, you should use mathematical procedures to estimate the amount of error. Unreliable data should be discarded.
A reliable measure is not necessarily operationally valid. A valid measure measures the concept of interest. Evidence based on a measure’s content documents that the operational definition is relevant and representative. Criterion-based evidence empirically establishes the relationship between a measure and a criterion. The criterion may be a similar measure, an alternate measure, or a future outcome. Evidence may also be gathered to document the consequences of implementing a measure. The evidence validating a measure informs and supports the judgment of data users; it does not replace their judgment. A sensitive measure sufficiently distinguishes cases from each other so that they can be compared.
More detailed discussions on measurement, especially quantitative evidence of reliability, can be found in texts on educational measurement or psychometrics. A recommended text is Susana Urbina, Essentials of Psychological Testing (New York: John Wiley & Sons, 2004), which is a basic, accessible resource on measuring, including knowledge, abilities, attitudes, and opinions.
A standard reference on measurement is Standards for Educational and Psychological Testing (Washington, D.C.: American Educational Research Association, 1999). The standards were created by a joint committee of the American Educational Research Association (AERA), American Psychological Association, and National Council of Measurement in Education and approved by their respective governing bodies. The AERA endorsement stated, “We believe … the Standards to represent the current consensus among recognized professional[s] regarding expected measurement practice. Developers, sponsors, publishers, and users of tests should observe these standards” (Standards, viii).
CHAPTER 2 EXERCISES
There are three separate exercises for this chapter. Each exercise develops your competence in interpreting and applying measurement concepts.
• Exercise 2.1 Good Nutrition Survey focuses on the strategies for designing reliable and operationally valid measures.
• Exercise 2.2 Living Wage: From Idea to Measure focuses on the concept of living wage. The exercise suggests the relationship between measurement and politics, and how a concept is measured and how the data will be used. The exercise implies the complexity of developing a sound operational definition.
• Exercise 2.3 Selecting Job Applicants focuses on employee hiring and implementing strategies to develop and assess the reliability and operational validity of the selection process.
• Exercise 2.4 On Your Own asks you to design and assess a relevant measure for your agency (where you work or have an internship).
EXERCISE 2.1 Good Nutrition Survey
Scenario
A nutritional council has as its mission to improve public knowledge of nutrition and advocate for healthier diets. You have been asked to help the council design a self-administered survey to measure an individual’s knowledge of sound nutrition and eating habits.
Section A: Getting Started
1. Why should the council develop a conceptual definition?
2. Find three conceptual definitions of good nutrition.
3. Which of the three definitions seems most appropriate for the nutritional council to use? Justify your choice.
4. Based on the conceptual definition write an operational definition.
5. What steps would you take to develop evidence that your operational definition is reliable?
6. Should you focus on the qualitative evidence of reliability prior to implementing any procedure to establish quantitative evidence? Justify your answer.
7. What evidence would you cite to argue that your operational definition is operationally valid?
8. Consider the appropriateness of using your operational definition to measure the nutrition levels of
a. a predominately Hispanic community.
b. pregnant women.
c. low income children.
9. In the course of writing surveys, you may become concerned with the reliability or operational validity of specific questions. Comment on each of these proposed questions.
a. Does your family eat a healthy diet?
b. How many servings of protein do you eat daily?
c. How many times did you eat breakfast in the past 7 days?
d. How many calories do you consume in a typical day?
10. Long surveys often have a low response rate. Would you argue that asking only one or two questions is sufficient to measure an individual’s knowledge of good nutrition? Justify your answer.
Section B: Class Discussion
1. Based on the conceptual definitions that class members found, consider which definition seems most appropriate to
a. design an educational campaign, Eat Smart.
b. design sample menus and recipes for council publications and fliers.
c. create a dietary checklist for medical personnel to include in a physical exam.
2. Choose one of the operational definitions identified and use the qualitative indicators of reliability to assess how well it would work for a
a. telephone survey of the general population.
b. self-administered survey of parents of children who receive subsidized meals.
c. self-administered survey of elderly persons who eat one meal a day in senior centers.
3. What evidence suggests that the operational definition is operationally valid?
4. Questions may focus on respondents’ perception of the quality of diet or they may ask for more specific information, such as, listing foods actually eaten in the past 48 hours. Assess the benefits and drawbacks of asking
a. questions about what people believe regarding different types of food.
b. questions about what foods people eat.
EXERCISE 2.2 Living Wage: From Idea to Measure
You have been asked to serve on a community task force to consider whether it should adopt a livable wage ordinance.
Section A: Getting Started
1. What does the term livable wage mean to you?
2. Search the Web to identify a discussion of the livable wage: what it means, its desirability, the difficulty in determining what constitutes a livable wage. What are the major measurement issues that the discussion focused on?
3. Do a Web search and find two more conceptual definitions of livable wage.
4. Deciding on an appropriate conceptual definition is a matter of judgment colored by how the data will be used. For each potential user on the following list, identify which definition you think would be preferred and why.
a. By employers deciding on an acceptable salary for their lowest-paid employees
b. By an advocacy group lobbying for an increase in the minimum wage
c. By a public agency determining who is eligible for services, such as financial assistance, food subsidies, and medical services.
5. Respond to the following items to develop an operational definition for “livable wage.”
a. Identify the components you would include in an operational definition of a livable wage.
b. In your operational definition, will some items receive more weight than others or will all items be weighted equally? Explain.
c. Do you anticipate that the operational definition will work equally well in urban and rural areas? Explain.
6. Consider measuring the cost of rental housing.
a. Write an operational definition for cost of rental housing.
b. Would you expect to get more reliable data from landlords or renters? Explain.
Section B: Class Exercise
In assigned groups hold a task force meeting. Discuss the following:
1. Share the conceptual definitions and identify which definition(s) the task force should adopt or present for public debate.
a. What components of the conceptual definition do group members agree on and disagree on?
b. Are the reasons group members disagree based on ideology, purpose of the measure, impact of the data, feasibility of developing or implementing an operational definition, or something else?
2. Identify how you would decide what wage would constitute a livable wage in your community.
3. Identify what operational definition you would recommend to identify the number of wage earners in your community who do not earn a livable wage.
4. Based on what you have learned and observed in answering Sections A and B create a list “Lessons Learned about Measuring Concepts.”
EXERCISE 2.3 Selecting Job Applicants
Reliability and operational validity apply to many tasks, especially if we want to rank or categorize something. In this exercise we ask you to use your knowledge of measurement and develop a procedure for reviewing job applications for the position of Director, Whitney M. Young Center. After reading the announcement, respond to the items listed after the job announcement to guide you.
Job Announcement: Director
The Whitney M. Young Center for Urban Leadership (WMYCUL), an affiliate of the National Urban League, is a nonprofit educational institute that exists to foster positive social and economic change through effective leadership development. Our mission is to cultivate and enhance the leadership capabilities of individuals and organizations that serve urban communities. Our role is to convene and link those entities to practical leadership development tools and resources to help them address capacity issues and various leadership challenges. Our philosophy is that through the enrichment and/or development of urban leaders, urban communities can be improved and empowered, thereby helping to create a society whereby all people have equal opportunity to be positive contributors.
Position Description
The director coordinates and manages the professional development/leadership training needs of 100-plus affiliates located throughout the United States and the National Office.
Essential Functions
■ Oversees professional development/leadership training for the affiliates and National Office. Develops videos, resources guides, handbooks, manuals, and guidelines for the affiliates. Identifies 21st century resources to enhance system-wide capacity.
■ Coordinates and monitors training activities, including annual leadership conferences.
■ Works with staff to develop strategies to support strategic direction of center
■ Performs other duties at assigned
Requirements
■ Master’s degree in social work, public administration, or related area
■ Certified Professional in Learning & Performance (CPLP) a plus
■ Seven to ten years experience designing training curriculum, program design and execution, or leadership development; experience with national professional training network a plus
■ Able to plan, organize, control, delegate, and manage multiple projects simultaneously
■ Familiar with nonprofit governances and operations
■ Strong analytical, presentation, and facilitation skills. Must have meeting-planning experience. Excellent interpersonal and communication skills.
■ Proficiency in Microsoft Office.
To apply, submit resume, cover letter, and writing sample to Human Resources Department, or e-mail [email protected]. Deadline September 7.
The National Urban League is the nation’s oldest and largest community-based movement empowering African Americans and other people of color to enter the economic and social mainstream. The National Urban League is an Equal Opportunity Employer.
Section A: Getting Started
1. Why is it important to have a reliable and valid process for selecting which applicants to interview?
2. You plan to create a rating scale, that is, you identify and assign a numeric value to each key requirement. How would you use qualitative methods to make sure that the scale is reliable?
3. How would you use test–retest to establish the scale’s reliability?
4. Assess the value of using test–retest to establish the scale’s reliability.
5. How would you use inter-rater reliability to establish the scale’s reliability?
6. Assess the value of using inter-rater reliability to establish the scale’s reliability.
7. Create a plan to establish the scale’s reliability, that is, consider qualitative methods, test–retest, and inter-rater reliability. Which strategy(ies) would you use, why, and in what order would you use them?
8. What would you look for to establish the scale’s operational validity?
9. If after you complete the first round of interviews you find that none of the interviewed applicants were suitable, would you conclude that the problem was with the reliability or the operational validity of the scale? Justify your answer.
10. Review the job description and identify the requirements that should be included in the scale. Also note if any of the requirements should be given more weight.
11. Examine the following two resumes and create a scale that could be used to rate each resume.
J. Q. Public
123 Main Street
Anytown, Yourstate, USA
(111) 555-1234
Objective
To apply my education and strong work ethic toward a career in the public relations, specifically event planning and program coordination.
Keywords
Event planner, event coordinator, meeting planner, wedding planner, project manager, communications, advertising, public relations, entertainment.
Experience
Promotions Coordinator, Local Radio Station (2007–present) Planned, coordinated and executed over 200 on-site promotions and remotes. Assisted the Marketing Department in developing and executing station contests, promotions and events. Worked with programming department to carry out station programming agenda. Directed promotional street teams to implement on-site promotional campaigns. Created new contests and promotions that drive station-marketing objectives. Engineered remote broadcasts. Assisted in building and maintaining client relationships. Helped to execute many large-scale events including the Annual Ball, Community Jam, and Cultural Fest.
Executive Director, Any Magnet school, Inc. (2004–2007). Planned major special events and lead extracurricular activities. Worked with administration, faculty and student teams to plan homecoming events and fundraising dinners; tracked all milestones and ensure on-time, successful completion. Planned fundraising events such as International Day, including advertising/promotion, decorations, costumes, student presentations and sponsorship. Directed annual African American Heritage Festival, student competition with judges and awards ceremony attended by more than 500 people; program included dinner buffet and presentation of group projects on educational topics.
Project Manager. Human Services Agency, Inc. (1999–2004) Responsible for implementing the Independence Program, a welfare to work program that assisted heads of households transition from welfare to self-sufficiency through employment. Developed relationships with area employers and promote program services. Serve as liaison to collaborating agencies including State Department of Human Resources, City Department of Social Services, and other community based organizations. Hired and supervised a professional and paraprofessional staff of 15. Developed and revised forms, manuals, brochures, and MIS systems. Compiled and provided monthly and quarterly reports to funders and stakeholders.
Program Director. Children’s Human Services, Inc. (1994–1999) Responsible for the successful implementation of a childcare center for homeless children with special needs. Managed program budget of $275,000. Supervised a staff professionals and paraprofessionals and college level human service program interns. Coordinated program activities with collaborating agencies for service provision, grant writing, planning, resource sharing, and development. Compiled and provided reports for funders and stakeholders.
Education
MSW and MA in Organizational Leadership, Midstate University (1995). BA, Communications, Atlantic State University (1990).
Additional Information
Computer skills include Windows, Word, Excel, PowerPoint and Internet research.
Jane Doe
9876 Elm Street. Springfield, (111) 555-1234
I am interested in a Event Planning position for a corporation or nonprofit organization. In addition to the Bachelors of Science Degree in Hospitality Tourism Management/Business Administration, with an emphasis in Events, Attractions and Conventions Management, I also possess the following qualifications and areas of expertise:
Education
Columbia Southern University: MBA with Concentration in Hospitality and Tourism
University of Delaware: Bachelor of Science Hotel, Restaurant, and Institutional Management
Work Experience
Independent Contractor – Open Events, LLC. 2004–present
■ Assist with the planning and management of annual sales meetings for international high-tech companies,
■ Work with and coordinate food and beverage staff, audio-visual personnel, travel department, ground transportation, security and labor.
■ Oversee the planning and execution of seminars in the US, Caribbean, and Europe
■ Perform site research and negotiation, menu selection, registration and travel coordination, shipping and receiving materials between destinations and the on-site management of the entire event program.
Corporate Affairs Director – Telephone company Public Relations 2000–2004
■ Established and leveraged strategic relationships with key stakeholders in the Hispanic, Asian, and businesswomen’s markets. Successfully positioned these associations to win support for Telephone Co. sales initiatives and public policy.
■ Partnered with Supplier Diversity division to ensure engagement with minority suppliers of all groups, i.e., African American, Asian Pacific American, Hispanic, and Women.
■ Directed Telephone co. multimillion dollar philanthropic investments with these and other national organizations via the Telephone Co. Foundation and business unit funding. Garnered significant coverage in mainstream as well as Spanish-language and Asian-language media for Telephone Co. grants and sponsorships.
■ Secured more than 3,100 constituent letters sent to Congress and several op-eds in support of Telephone Co. public policy positions.
Promotions Coordinator-Local Radio Station 1994-2000
■ Planned, coordinated and executed over 200 on-site promotions and remotes.
■ Assisted the Marketing Department in developing and executing station contests, promotions and events.
■ Worked alongside programming department to carry out station programming agenda.
■ Worked in conjunction with promotional street teams to implement on-site promotional campaigns.
■ Helped to create new contests and promotions that drive station-marketing objectives.
■ Assisted traffic department with entering sales orders, filing, and writing contracts.
■ Engineer remote broadcasts.
■ Assisted in building and maintaining client relationships.
Other Skills & Experience
Self-starter, goal orientated, assertive, and possess a warm outgoing personality. • Excellent time management skills, detail orientated, strategic thinker and a track record of making sound decisions. • Style which exhibits maturity, high energy, sensitivity, teamwork, and the ability to relate to a wide variety of professionals. • Strong interpersonal communication skills with ability to effectively identify and fulfill customer needs • Fluent in Portuguese and Spanish • Familiar with Mac, PC and Internet platforms, Microsoft Word, Excel and Quicken
12. Now rate both applicants using your scale.
Section B: Class Discussion
1. In small groups review each member’s rating scale.
a. Use the qualitative method to evaluate the reliability of each person’s scale reliability. Indicate what changes need to be made to improve any problematic item’s reliability.
b. Assess each scale’s operational validity. Taken as a whole does each scale’s choice of items seem relevant? Do the items seem representative? What evidence supports your conclusion?
c. What empirical evidence would you seek to demonstrate that the scale was operationally valid?
2. Choose one of the scales and have each member of the group apply it to the two resumes. To assess the inter-rater reliability of the selected scale. Compare your ratings.
a. How similar are they?
b. On which items are your ratings the most different? How can you improve these items?
3. Based on this exercise draft a guide “Suggestions for developing and assessing the reliability and operational validity of measures to select applicants for jobs and promotions.” Compare your guidelines with those developed by your classmates.
This section is applicable if you are working or have an internship. This section of the exercise guides you through the steps required to design a reliable and operationally valid measure and to observe the value of stakeholder feedback.
Consider the concepts you are trying to measure. List and prioritize them.
1. From among these concepts select a key concept. Find out how similar organizations define it. Be sure to look at the relevant literature for definitions.
2. Consider how these definitions are similar or different. Is the concept or definition contested? Has it evolved over time? If so, in what ways and what influenced the change?
3. Adapt an existing operational definition for the concept or develop your own.
4. Assess its reliability and operational validity.
5. Review your conceptual and operational definition with relevant colleagues or other relevant stakeholders.
a. Do they agree that you are measuring what you intend to measure?
b. Do they agree that your operational definition is both relevant and representative?
c. Do they raise concerns about the wording of items and the responses, the willingness and ability of respondents to provide the information?
NOTES
1 This definition is used by the National Assessment of Adult Literacy, an assessment conducted by the National Center for Educational Statistics. For more information go to http://nces.ed.gov/naal/.
2 You can find information on the Internet about many types of measures and indicators. For example, if you want to get an evaluation of indicators of volunteerism, you could enter “volunteerism indicator critique” into a search engine. You will find sources that present a critique that will add to your understanding of measurement and the construction of measures. For this paragraph, we relied on the following articles and Web sites (all accessed on June 7, 2010): Tina Soreide, “Is it right to rank? Limitations, implications and potential improvements of corruption indices,” Chr. Michelsen Institute, Norway, http://www.cmi.no/publications/publication/?1973=is-it-right-to-rank-limitations; Daniel Kaufmann, Aart Kraay, and Massimo Masruzzi, “The Worldwide Governance Indicators Project: Answering the Critics,” The World Bank, http://www-wds.worldbank.org/external/default/WDS ContentServer/IW3P/IB/2007/02/23/000016406_20070223093027/Rendered/INDEX/wps4149.txt; “Measuring Progress,” Friends of the Earth, http://www.foe.co.uk/campaigns/sustainable_development/progress/replace.html
3 L. Darling-Hammond, “Evaluating ‘No Child Left Behind,’” The Nation, May 2, 2007. Posted at http://www.thenation.com/doc/20070521/darling-hammond (accessed December 7, 2009); J. E. Ryan, “The Perverse Incentives of the No Child Left Behind Act,” New York University Law Review, June, 2004.
Ethical Treatment of Research Subjects
At some point during the research process investigators may require research subjects, that is, people who will answer surveys, agree to interviews, participate in focus groups, or enroll in a demonstration project. People who agree to answer questions or participate in a study expect to be treated respectfully and ethically; they do not expect to be harmed by merely participating in a study.
Subjects who provide information for community assessments or answer stakeholder surveys are unlikely to experience physical harm or acute psychological distress. Rather the “harm” they experience may be subtle, such as losing some privacy, wasting their time, or experiencing unpleasant emotions such as anger, defensiveness, or distrust. Staff that design, implement, and evaluate a program must be attuned to a study’s potential to cause significant harm to participants.
This chapter covers U.S. regulations and professional standards that apply to community and organizational research involving human subjects. To put current standards in context we will summarize the Tuskegee Syphilis Study, a landmark case in unethical research. The Tuskegee study is important for two reasons. First, its research procedures were identified as unethical practices that regulations needed to cover. Second, it demonstrated the serious social consequences of failing to protect research subjects.
A CORNERSTONE OF ETHICAL RESEARCH: THE TUSKEGEE SYPHILIS STUDY
The Tuskegee Syphilis Study is the best known U.S. example of an egregious abuse of human subjects. The study’s subjects, all of whom were poor African American men, were explicitly denied effective treatment for syphilis, that is, they were harmed simply by participating in the research.
In 1932, U.S. Public Health Service researchers started monitoring the health of two groups of African American males. One group had untreated syphilis; the other group was free of syphilis symptoms. The objective of the research was to document the course of untreated syphilis. At the time the study began, treatments for syphilis were potentially dangerous, so the researchers may have not questioned the ethics of not treating subjects. By the mid-1950s penicillin was widely available and known to be an effective treatment for syphilis. Yet the study continued until 1973, and the untreated subjects still had not received penicillin, and they had been actively discouraged from seeking treatment elsewhere. The failure to treat the subjects was particularly disturbing because the study continued unchallenged despite the Nuremberg Code.1
At the end of World War II, disclosure of Nazi atrocities included reports of doctors and scientists who performed human experiments on Jewish prisoners. Military tribunals were held to try the Nazi leadership. One of the Nuremberg Military Tribunal verdicts listed 10 principles of moral, ethical, and legal medical experimentation on humans. The principles, referred to as the Nuremberg Code, have formed the basis for regulations protecting human research subjects. The Tuskegee study violated the code’s principles, including not asking the subjects to give their free and informed consent to participate, ignoring the researchers’ obligation to avoid causing unnecessary physical suffering, not allowing the subjects to terminate their participation at any time, and disregarding the researchers’ obligation to discontinue an experiment when its continuation could result in death.
Traditionally experiments include a control group, that is, a group that does not receive the experimental treatment. Ethical research practice prohibits with-holding beneficial treatment from subjects. If the most beneficial treatment remains unknown, the control-group subjects must be assigned to an alternate beneficial form of treatment. Subjects cannot simply be denied treatment. For example, if you were to conduct a study of depression, all subjects would need to be assigned to some form of treatment (e.g., psychotherapy, narrative therapy, medication). Because we know so much about effective depression treatment, each form of treatment would be better than no treatment at all. If the subjects receiving an experimental treatment show marked improvements the study must be discontinued and the treatment made available to the control group. The Tuskegee study had a lasting effect in assuring that no research subject was denied beneficial treatment.
The shaded area below summarizes four other well-known studies that raised questions about whether the subjects had been treated ethically. Although no documentation exists that indicates these cases caused harm to individuals, they raised issues that have informed ethical practice.
Four Classic Ethics Cases Involving Human
Jewish Chronic Disease Hospital (1963). 2
What Happened: Elderly patients were asked to give consent to be injected as part of research on immune system responses. They were not told that the injections were unrelated to their disease or its treatment or that the injections contained live cancer cells.
Ethical Concern: Investigators found that vague request for consent did not constitute informed consent.
University of Chicago Law School Taping of Jury Deliberation (1954) 3
What Happened: Discussions among juries hearing civil cases were recorded without their knowledge, but with the permission of the judge and litigant’s attorneys. Recordings were to be kept with judge until the case was closed.
Ethical Concern: Potential loss of confidence in public institutions by compromising the secrecy of jury deliberations.
The Tearoom Trade (1970) 4
What Happened: A doctoral student served as a lookout and observer for men having sexual encounters in public restrooms. Later he used license plate numbers to track down the men and interview them under the pretense of conducting a public health survey.
Ethical Concern: An example of deceptive research, where a subject is not told the purpose of the research and potentially the researcher discloses something the subject considers private.
The Milgram Experiments (1961) 5
What Happened: Subjects were recruited to give other subjects (actually actors) what they thought were electric shocks.
Ethical Concern: Another example of deceptive research. In this case subjects might gain unwanted self-knowledge. ■
PRINCIPLES OF ETHICAL TREATMENT OF HUMAN SUBJECTS
In response to the Tuskegee study and other reported abuses, the U.S. government formed the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research and charged it with identifying the basic ethical principles for research involving human subjects. The commission summarized its findings in the Belmont Report.6 The report identified three basic ethical principles: respect for persons, beneficence, and justice. Respect for persons requires that subjects participate voluntarily and with adequate information about the proposed study to give informed decisions regarding participation. Beneficence requires doing no harm and maximizing possible benefits and minimizing possible harms. Justice requires that research subjects not be selected “simply because of their easy availability, their compromised position, or their manipulability, rather than for reasons directly related to the problem at hand.”
These principles underlie U.S. regulations, which apply to investigators conducting research for a federal agency or on a federally funded project. The principles should guide the behavior of all members of a study team whether or not the study is covered by federal regulations. Implementation of these principles requires that subjects give informed consent, that benefits and risks be identified and weighed, and that selection of subjects be fair. Informed consent demonstrates respect for persons; assessing risks and benefits demonstrates beneficence, and fairly selecting subjects demonstrates justice.
A potential research subject must have adequate information to make an informed, voluntary decision to participate. You must tell potential subjects in words that they can clearly understand about the study’s purpose, its potential risks, and possible benefits. They need to know what they will be expected to do, what will be done to them, and what will be done with the collected information. You should provide them with information that answers the following questions. Who will receive the information? What type of information will be disseminated? What steps will be taken to protect the identity of the subject? What will happen if you learn of an illegal act or identify a health risk? If photos, movies, recordings, or similar research records are being produced, what will be done with them? We know of one student who was shocked to learn that an interview tape, produced as part of an experiment, was being shown to classes at her college.
Specific circumstances may impede a subject’s ability to make a voluntary decision. Prisoners, members of the military, and schoolchildren, all of whom are in controlled settings, may interpret requests for information or participation as commands. Research involving prisoners is particularly difficult. Even modest incentives can compromise a prisoner’s ability to assess the risks of participation and may preclude a voluntary decision to participate. Similarly, patients may mistakenly believe that their research participation will have a therapeutic benefit. They should receive clear, realistic information on the benefits and risks of participation. Still, the ability of seriously ill persons to give informed consent, no matter what they are told, is questionable. A New York Times article summed up the problem as follows: “Potential participants are often desperately ill and may grasp at any straw—even signing a document without reading it. For this reason, many say there is no such thing as informed consent—only consent.”7 Vulnerable populations, such as children, aged people, and mentally disabled people, may not be fully capable of making an informed decision or protecting their own interests. In general, researchers try to get informed consent from both the potential subjects and from their legal guardians.
An authority relationship between the investigator and potential subjects may raise questions about voluntary consent. Physicians, social workers, or professors may ask their patients, clients, or students to participate in a research study. A patient may agree because she doesn’t want to sour the relationship with her physician. A client may agree because he may suspect that he will be overlooked for other opportunities to alleviate his problems. A student may agree because she suspects that it won’t hurt and may help her course grade.
Finally, voluntary participation requires the ability to withdraw from a study at any time. Potential subjects must be told this as part of the process of obtaining informed consent. Furthermore, they must also be told that other benefits they are entitled to will not be affected by their decision to participate or to discontinue their participation. For example, a client receiving social services must be told that his continued eligibility for these services does not depend on participating in a research study.
Informed consent and a signed informed consent form are not one and the same. A subject must be free to make her own decision about whether she wishes to participate or not. To make this decision she must be informed about the study and give her consent to participate. A signed consent form merely documents that she received the information and consented to participate. For projects where a subject experiences no risks beyond the risks of everyday life or ordinary professional responsibilities, signed statements may be reasonably viewed as unnecessary. Online surveys typically open with a statement describing the research; subjects indicate their willingness to participate by clicking on an “accept” button. Below we give you an example of obtaining consent for an opinion survey. Just because the recipient of a survey does not sign an informed consent form does not relieve you of the responsibility of giving the subject sufficient information so her consent to participate is truly informed.
Identifying and Weighing Costs, Risks, and Benefits
People may not agree on the risks and benefits of research participation. Individuals’ educational, social, and professional backgrounds contribute to how they define and rank risks and benefits. You may incorrectly assume that a proposed study presents minimal or no risk. You should be especially vigilant if potential subjects differ from you. For example, you may underestimate the potential harm in studying recent immigrants. You may misjudge what constitutes a risk or a benefit for an immigrant and erroneously assume that your requests for consent are unbiased and informative.
Request for Consent to Participate on an Online Survey
This is a study about opinions regarding various issues in the news. This is an anonymous survey and your identity is not connected to your responses in any way. Clicking “Yes” below indicates that you agree to participate in this study.
□ Yes, I agree to participate.
□ No, I do not wish to participate. ■
What are the benefits of participating in research? The research may involve a treatment or program expected to relieve a psychological problem or increase economic opportunities. The research may seem to benefit a group that a potential subject values. Survivors of floods and wildfires may agree to participate in studies because they believe that the findings will contribute to better future response and recovery efforts. For some studies—perhaps most—the subject may participate because the research question seems somewhat interesting and the inconvenience is minimal.
Remuneration may be a valuable benefit. We know of a few graduate students who subsidized their incomes by participating as subjects for biomedical research studies. Paying subjects for the inconvenience of participating is not unethical, unless the amount is so large that it may be viewed as a bribe or a questionable inducement to participate. What distinguishes reasonable reimbursement from “questionable inducements”? Federal regulations offer no guidance, and opinions vary as to whether participants should receive more than compensation for their direct costs.
What are the risks of participating in research? The most commonly cited risks are physical harm, pain or discomfort, embarrassment, loss of privacy, loss of time, and inconvenience. A study could potentially uncover illegal behavior. As part of informed consent potential subjects must be told what risks they may experience during the study or as a result of the study. They must be told if the risks are unknown, or if researchers disagree on the risks.
Subjects can only be told about benefits that can be reasonably expected. Theoretically, a study may be groundbreaking; however, most studies are not. Consequently, a subject should not be told that a study has a probability of generating significant knowledge. Nor should potential subjects be led to believe that they will gain benefits that are possible but unlikely.
Although not usually covered as part of informed consent you may wish to consider that risks and benefits may apply beyond individuals. For example, families may be affected by the time a family member spends participating in a study, by his reactions after sharing personal information, or by the outcome of participating in a program to encourage lifestyle changes.
Recruiting subjects is the first step in assuring informed consent. Fliers, Internet sites, or other recruiting materials and media should state that participants are sought for a research project. The words used to solicit subjects may act as a questionable inducement. Consider the attractiveness of being asked to test out a “new” or “exciting” treatment, to participate in a “free” program, or to receive “$1000 for a weekend stay in our research facility.” If you recruit participants through personal contact, you must be particularly sensitive to not pressuring them to participate.
Simply contacting potential subjects can raise privacy concerns. Privacy refers to the ability to control disclosure and dissemination of information about oneself. One dimension of privacy is physical privacy, that is, not having to endure unwanted intrusions. If a person is likely to wonder “how did they get my name?” you may have violated his privacy. For example, if you plan to collect information from food pantry users, a staff member who knows the users should ask their permission to be contacted. They should be assured that whether or not they choose to participate will not affect services they normally receive.
Sample size is not included as a component of informed consent, nor is it discussed in behavioral and social science texts on research ethics. Still, the desired number of subjects may affect how vigorously subjects are recruited and open the door for subtle coercion. The American Statistical Association includes sample size in its ethical guidelines. Its guidelines state that statisticians making informed recommendations should avoid allowing an excessive number of subjects or an inadequate sample.8 An inadequate sample can affect the quality of the statistical analysis, for example, decisions about whether findings occurred by chance. Conversely, an overly large sample may squander resources, including participants’ time. An appropriate sample size partially depends on how the data will be analyzed. For example, a sample of 400 might be adequate to describe volunteer activities of state residents, but it may be too small if you want to compare volunteers in the state’s counties.
PROTECTING PRIVACY AND CONFIDENTIALITY
You may encounter the terms privacy, confidentiality, anonymity, and research records in discussions on collecting individual data. As we noted, privacy refers to an individual’s ability to control the access of other people to information about himself. Confidentiality refers to protection of information, so that you cannot or will not disclose records with individual identifiers. Anonymity refers to collecting information so that you cannot link any piece of data to a specific, named individual. Research records refer to records gathered and maintained for the purpose of describing or making generalizations about groups of persons. Research records are different from administrative or clinical records, which are meant to make judgments about an individual or to support decisions that directly and personally affect an individual.
The requirements of voluntary participation and informed consent uphold the individual’s right to have control over personal information. While you may promise anonymity or confidentiality, a potential subject may not necessarily trust you to follow through. Guarantees of confidentiality neither ensure candor nor increase propensity to participate in research; rather, limited research has found that respondents tend to view promises of confidentiality skeptically.9 Nevertheless, you must respect participants’ privacy and maintain confidences as part of your professional responsibilities to subjects.
Some questions may seem unduly intrusive. They may stir up unpleasant recollections or painful feelings for subjects. Such topics include research on sexual behaviors, victimization, or discrimination. People who read pornography, have poor reasoning skills, or harbor controversial opinions may prefer to keep this information to themselves. Disclosure of behaviors such as drug use, child abuse, or criminal activity may cause a respondent to fear that she will be “found out.” For a study of a sensitive topic to be ethical (1) the psychological and social risks must have been identified, (2) the benefits of answering the research question must offset the potential risks, (3) the prospective subjects must be informed of the risks, and (4) promises of confidentiality must be maintained.10
In deciding whether to keep responses confidential or to collect anonymous data you may first consider subject anonymity, that is, ensuring that no records are kept on the identity of subjects, and data cannot be traced back to a specific individual. If you are conducting a study for an agency, an approach that approximates subject anonymity is for agency staff to select the sample and distribute questionnaires or collect information from agency files. The staff should delete any identifying information from the records before allowing you to examine the records.
Often, however, anonymity is impossible. You must know subjects’ names to follow up on nonrespondents or to compare respondents and nonrespondents; to combine information provided by a subject with information from agency records; or to collect information from an individual at different points in time. Subject names must be available for research supervisors or auditors to verify that the research was done and that accurate information was collected and reported. Although research auditing can raise concerns about confidentiality, without audits, incompetence or malfeasance may go undetected.
Confidentiality may also be breeched by carelessness, legal demands, or statistical disclosure. To avoid accidental disclosure of personal information, identifying information should be separated from an individual’s data. A code or alias can be assigned to each record and the list of names and codes stored separately in a secure place. For longitudinal studies and other studies that collect information from more than one source, aliases can be used to identify individuals and combine the individual information. A respondent may choose her own alias, that is, information that others cannot easily obtain, such as her mother’s birth date. The success of having respondents choose their alias depends on their ability to remember each time they are asked.
Theoretically, researchers can have their records subpoenaed. Federal policies offer some protections to participants and researchers. The Confidential Information Protection and Statistical Efficiency Act (CIPSEA) requires that federal statistical agencies inform respondents and get their informed consent if the information is to be used for nonstatistical purposes.11 Certificates of confidentiality by and large protect researchers investigating sensitive subjects, such as mental illness or drug abuse, from having to provide identifying information. The U.S. Department of Health and Human Services offers certificates to researchers whose subjects might be harmed if they were identified. The Department of Justice offers certificates to researchers conducting criminal justice research. However, a certificate cannot be obtained after data collection is completed.12
When reporting data or sharing research records, you should be sensitive to unintended disclosures. For example, when reporting on a case study you should consider using pseudonyms and altering some personal information, such as occupation. Before any research records leave your control, you should identify potential breaches of informed consent and confidentiality. A subject may have agreed to participate for a specific purpose, without giving blanket authorization for other uses. As part of obtaining informed consent you should identify anticipated future uses of the data, including their availability for independent verification of the study’s implementation and replication of its analysis.13 Before releasing data, make sure that identifiers such as names, addresses, and telephone numbers have been removed.
A more formidable problem is that of deductive disclosure. If the names of participants in a study are known, someone could sort through the data to identify a specific person. With a list of respondents, someone may sort the data by age, race, sex, and position and deduce a respondent’s identity. One way to protect against such abuses is to not disclose the list of respondents. Deductive disclosure is a concern with publicly available electronic databases. For example, if census data were released as reported, one could learn detailed information about the only three Hispanic families in a county or a state’s six female-owned utilities companies. To prevent such abuses the Census Bureau has developed procedures to release as much information as possible without violating respondents’ privacy. As is true with many of the Census Bureau’s statistical practices these procedures can serve as a model.
An emerging concern is the ethical implications of Internet research. Research might be done on Web sites, electronic bulletin boards, Listservs, or social networking sites. It may be naïve to assume that any communications sent out into cyberspace is private, and there is no consensus about what online communications should be considered public as opposed to private. Information found on a Web site may be treated the same as other textual material, that is, you would not normally seek informed consent or be concerned about privacy.
FEDERAL POLICY ON PROTECTION OF HUMAN SUBJECTS AND INSTITUTIONAL REVIEW BOARDS
In 1991 a uniform federal policy for the protection of human subjects, the Common Rule, was published. Its hallmark was a requirement that every institution receiving federal money for research involving human subjects create an Institutional Review Board (IRB) and appoint its members. The IRB determines if a proposed project meets the following requirements: (1) risks to subjects are minimized, (2) risks are reasonable in relation to anticipated benefits, (3) selection of subjects is equitable, (4) informed consent will be sought and appropriately documented, (5) appropriate data will be monitored to ensure the safety of subjects, and (6) adequate provisions exist for ensuring privacy of subjects and confidentiality of data.14 Two of these criteria merit further mention. First, the long-range effects of the knowledge gained from the research are explicitly excluded in determining the risks and benefits of participation. Second, the need to address possible abuses of vulnerable populations is stressed. IRBs are encouraged to consider whether research on a specific population is consistent with an equitable selection of subjects and to make sure that these populations’ vulnerability “to coercion or undue influence” has not been exploited.
An IRB reviews all research under the purview of the institution that involves human subjects. To review only publicly or privately supported research would imply that only funded projects have to conform to ethical practices. An institution that fails to comply with the Common Rule can have its federal funding terminated or suspended. Some university IRBs have been accused of being overly cautious and throwing unnecessary roadblocks in the way of research that includes surveys or field studies, to avoid potentially harmful consequences.15
To what extent do you have to concern yourself with IRB review? Federal policies protecting human subjects require compliance by federal agencies, institutions, and individual researchers. If you are a student or an employee of a university, a medical facility, or other institution that receives federal research funds and you will be conducting a study that involves interacting with people, asking them to do something, or using identifiable private information, consult with your IRB before you start your research. Putting off learning about your institution’s IRB procedures can lead to long delays and frustration while conducting your study.
The IRB chair or the chair’s designee will determine if your project is exempt from further review, appropriate for an expedited review, or must be subject to full review. Projects exempt from review or eligible for an expedited review receive less close scrutiny. The categories of exempt projects or expedited review allow research involving minimal risk to proceed and avoid the long delays associated with a full IRB review. Minimal risk applies to those projects where the risks of participating in the research are similar to the risks encountered in daily life. For example, an IRB may waive written documentation of informed consent for surveys and observation of public behaviors. Waivers, however, are not permitted if responses could be traced to a specific individual and if disclosure could result in civil or criminal liability or damage subjects’ financial standing, employability, or reputation. The federal regulations are more detailed than what has been presented here. Furthermore, the regulations receive frequent scrutiny, and the practices are still evolving. Members of an IRB or the National Institutes of Health Office of Human Subjects Research Web site (http://ohsr.od.nih.gov) are good sources for detailed or up-to-date information.
BEYOND INFORMED CONSENT AND CONFIDENTIALITY: ISSUES OF INTEREST TO ADMINISTRATORS
Public agencies and nonprofits that do not conduct federally funded research may not have an IRB. Still, the agencies should adhere to the values and practices identified by the Belmont Report. Recall that these values and practices were
Respect for persons requires informed consent;
Beneficence requires that benefits outweigh risks;
Justice requires a fair selection of subjects.
These values cut across professions, disciplines, and organizations. Whether you are an investigator or part of management you should ascertain that studies intended to describe groups of persons adhere to these principles and practices. The practices specifically apply to research efforts; they do not necessarily apply if the data are collected for administrative or clinical records. For example, ethical research practices do not require supervisors get informed consent from an employee to evaluate her or to demonstrate that the benefit of preparing and conducting a performance evaluation justifies its cost.
You and others involved in deciding whether to implement a study may ask the same questions that are asked to obtain informed consent. There are no right or wrong answers, but the answers will help you decide if the informed consent content and procedures are adequate, if the risk of participation is outweighed by the benefits, and if selection of participants is fair and unbiased. Key questions include the following: Will subjects be anonymous or confidential? What steps will be taken to protect their identities? What will they be asked to do? How much time will it take? Are there other potential risks? How are subjects going to be recruited and selected? What information will potential subjects be given to obtain informed consent? How will informed consent be obtained?
You should consider how the proposed research may affect the agency’s reputation. No matter what the agency’s role, its reputation may be enhanced by research that others consider valuable and harmed by research that others consider worthless, intrusive, or harmful. You may question whether a planned study will unduly infringe on respondents’ privacy or abuse their time. You should decide if a study requiring agency resources, including time, represents a good use of money or donor contributions. You should be convinced that a study will likely yield valued information. You should decide if assumptions that others will act on the findings are realistic.
Unreliable items, unwanted items, or unused studies waste respondents’ time. Items not operationally valid may abuse respondents’ goodwill, insofar as their responses contribute to incorrect or misleading conclusions. The detrimental effects of an unwanted or poorly designed study go beyond the respondents. Future studies also are affected if potential subjects become cynical about research and the value of their participation.
Seeking too much information or seeking it too often can build resistance to future requests for information. Consider the complaints of businesses and state and local governments that churn out data only to meet federal information requirements or the frustration of nonprofits that have to produce frequent reports to assure funders that their money is being well spent. If individual respondents perceive that the data are merely collected but not used, they may not only complain but also refuse requests for information.
The agency is responsible for protecting the privacy of its employees or clients. Potential subjects should be contacted by an agency representative, possibly by someone from the unit sponsoring the study. The agency representative should explain the nature of the study, get permission to give the potential subjects’ contact information to investigators, and in noncoercive language indicate the value of the subjects’ responses. Clients should be clearly told that if they decline to participate they will not lose eligibility for agency services. Public announcements, such as posters, may be used to recruit employees or clients. Voluntary participation is less likely to be compromised with posted announcements than by personal solicitation.16
For employee participation to be voluntary, whether or not employees decide to participate should not affect performance ratings, pay, or similar decisions. This should be clearly communicated to the employee.17 You should not assume that assurances will overcome employee beliefs that refusal to participate will have consequences or that promises of confidentiality will not be kept. If an employee plans to conduct research as part of a graduate program, the research must be vetted by the university’s IRB. The agency administrators should also consider how an employee’s research could impact the organization and participating employees. Is the research taking time away from other tasks? How will participants interpret the topic? Do they assume that the organization is going to address a long-standing problem? Or are they anxious that a major change is in the works?
Demonstration projects bring up concerns about what will happen once the project ends. Projects that are part of a research grant may be funded for a specific period of time, but the participants’ need for services may continue. In prisons or psychiatric hospitals, participants in an experimental program may feel abandoned if the program ends once the data are collected or the funding stops. A related issue is how participation in a demonstration project will affect the clients. For example, clients enrolled in a demonstration project that provides job training may find that no employer can use their new skills or that the training has not qualified them for advanced courses.
Ideally, relevant stakeholders should review a proposal. Input from representatives of the participant community are especially valuable in identifying potential risks. For example, parolees, ex-convicts, and prison guards might review proposals for studies involving prisoners. Even apparently innocuous groups such as employees may identify unexpected concerns. Stakeholders may be helpful in deciding if potential subjects will understand the purpose of the study and what is being asked of them and the risks involved.
You should also check that subjects will be debriefed, if appropriate, at the end of their participation. For example, subjects who are asked to perform job-related tasks may be disappointed or frustrated about their performance. A debriefing offers the researcher an opportunity to observe any negative effects of the research and to answer questions or concerns that a subject may have.
Another issue is what will become of the research documents. What will be done with the completed questionnaires, recordings, or other research materials?18 Will they be put into an archive? If so, the subjects should be told where the data will be stored and how individual identities will be protected. If the data are being collected by a consultant, will the agency receive completed questionnaires, spreadsheets, or electronic databases? Will the consultant remove individual identifiers? Potential subjects need to know who will have access to their information before they can give informed consent.
Ethical values affect us when we gather information and when we report our findings. The ethical issues associated with reporting are covered in Chapter 14. The following cover the major values that should guide our behavior as researchers as we prepare to conduct research.
Honesty: Strive for truthfulness in all scientific communications. Honestly report data, results, methods and procedures, and publication status.
Integrity: Fulfill promises and agreements. Act in good faith.
Confidentiality: Protect the communications of participants. Do not disclose personnel records, trade or military secrets, and patient records.
Nondiscrimination: Do not provide preferential treatment to participants on the basis of sex, race, ethnicity, or other factors that are not related to their scientific competence and integrity.
Competence: Understand and do not exceed your own professional capacity and limits.
Legality: Know and obey relevant laws and institutional and governmental policies.
Human subjects protection: When conducting research on human subjects, minimize harms and risks and maximize benefits; respect human dignity, privacy, and autonomy; take special precautions with vulnerable populations; and strive to distribute the benefits and burdens of research fairly.
Recognizing and protecting research subjects is imperative whether you are an investigator or a manager. In the event that federally funded research is being conducted, an IRB review is required. This process helps to make sure that you have appropriately identified risks and benefits, communicated them to potential subjects, and have adequate procedures to obtain informed consent. Although you may not have an institutional review board to guide your research, you must follow ethical practices in administering surveys, conducting interviews, or collecting data from records.
We may erroneously assume that research that does not cause physical harm and does not have a negative effect. Wasting subjects’ time is harmful, so is invading subjects’ privacy. Respect for persons, beneficence, and justice may seem like lofty ideals, but in practice as you apply these concepts you may identify and address potential problems. You will also save your agency from possibly wasting resources, diminishing its reputation, and facing litigation. The bottom line is that you should protect research subjects because it is the right thing to do.
Paul Oliver, The Student’s Guide to Research Ethics (Philadelphia: McGraw Hill, 2003).
Jay Katz, Experimentation with Human Beings (New York: Russell Sage Foundation, 1972). Although 40 years old this is an excellent resource on cases that informed discussions about protecting human subjects.
James F. Childress, Eric M. Meslin, and Harold T. Shapiro, eds. Belmont Revisited: Ethical Principles for Research with Human Subjects (Washington, D.C.: Georgetown University Press, 2008). This resource carries a series of essays that cover the development of the nearly 40-year-old Belmont Report and its impact on current discussions on protecting human subjects.
To keep up to date with federal regulations, visit the Web site of the Office of Human Research Protections within the U.S. Department of Health and Human Services: http://www.hhs.gov/ohrp/
Designing Performance Measures and Monitoring Systems
You may have heard the phrases “work smarter not harder” or “results-based management.” How do managers know how hard or how smart a person or group is working or whether a program is achieving results? They begin by collecting data that measure work and accomplishment. Collecting data on what a work unit, program, or organization is doing is called performance measurement or performance monitoring. Let’s assume that you want to learn more about the performance of a food pantry. (Food pantries distribute food to families and individuals at risk of hunger.) You might want to know how many pounds of fruits and vegetables does it have available for distribution? How many households does it serve? How many food pantry users eat a nutritious diet?
To answer these questions you need to apply several research skills. First, you can build on what you have learned about measurement in Chapter 2. You need reliable, operationally valid, and sensitive measures to gather data on the amount of food, the number of households, and the incidence of nutritious diets. You have to decide how often to collect and compile the data. Then you have to analyze and interpret what you find.
In this chapter we present skills that link research methodology to information that managers use daily. Specifically, we consider research tools to create and implement programs and monitor their performance by focusing on three topics that apply to performance monitoring and other quantitative studies. First, we present logic models, which play a role in designing, monitoring, and evaluating programs. The models link a program’s activities to its achievements and help you select operationally valid measures to track them. Second, we identify data sources commonly used to measure performance. Third, we discuss time series, a research design to present and interpret data over regular intervals. This chapter, however, does not provide a comprehensive introduction to performance measurement and monitoring systems; instead it lists resources that cover these topics in depth, in the Recommended Resources at the end of the chapter.
AN OVERVIEW OF PERFORMANCE MEASUREMENT AND MONITORING
A performance-monitoring system requires data. In deciding what data to include, remember that just collecting data may not provide useful information. Good managers want information that enables them to understand how their programs work, to communicate this to others, and to take corrective action if necessary. To create useful information you must (1) decide what to measure and when and how to measure it and (2) select statistical tools that effectively present the data and facilitate comparisons over time. As a strategic manager you can use performance information to help assign staff, develop an annual budget, or assess the effectiveness of a program. You can tell funders how their money was spent and what good it did. You may rely on the data to create an annual report, which shows what was done with what resources and for what recipients, and how these figures compare with those of previous years.
You may produce and monitor performance data to identify how often services are provided, the number of people served, the cost of providing the services, and the benefits generated. You may want to review information at regular intervals so you can monitor changes in how an organization or program uses its resources, responds to demands for service, and makes progress toward achieving its goals. For example, the manager of a food pantry may want to know: how the number of clients and their characteristics vary from one month to the next over the course of a year; if the availability of nutritious food has improved the diet of pantry users; or if the number of food-insecure families has changed over time. He may also want to know the average cost of providing one family with food and how this cost has changed over time.
All too often, programs are developed because someone has what sounded like a good idea. One such good idea, having children wear school uniforms, was advocated to decrease troublesome behavior. However, school uniforms have little or no effect on children’s behavior.1 Logic models may decrease the number of similar “good” ideas that waste public monies. As you build a model you and others should answer the question “why should this idea (a program or strategy) have a positive impact on a particular social condition?” Existing research may provide the answer, but often you have to rely on logic and thoughtful debate. Relying on existing research, logic, and debate, a logic model requires participants to explain why specific activities are likely to produce a particular effect and why certain resources are needed. A logic model may help you propose effective programs, formulate a strategy to implementation them, know what resources you need to amass and deploy, and select performance measures to provide feedback and demonstrate accountability. In our own experience logic models are useful from the time an organization plans a program through to its evaluation. Logic models are valuable in deciding what an organization should measure and monitor.
Simply stated, a logic model links the components consisting of a program’s resources, activities, and service units to an expected outcome. To build a logic model you should work with a management team and other relevant stakeholders. Start with the organization’s mission statement and identify how the program will help the organization achieve its mission. Typically, you should articulate what type of change you expect a program to produce. It may be a change in program participants, agency operations, or the community. You next identify the activities that must take place for the anticipated change to occur. You then identify the resources needed to conduct the activities as planned. Depending on why the model is being built you may set annual goals and estimate how long it will take to reach the long-term outcome. Figure 4.1 presents a general outline for a logic model and labels its components.
The logic model components are identified by conventional terms used to describe programs. Understanding these terms may help you to establish the operational validity of your measures. (Recall from Chapter 2 that operational validity requires that you ask if you are measuring what you intended to measure.) The concept of operational validity is important as you think about what constitutes an activity, an output, or an outcome. The common definitions of the related terms are
Inputs —resources needed to operate the program. These include staff, supplies, facilities, equipment, and materials and the annual dollar amounts necessary to pay for or purchase them.
Activities —what the program does. These may include marketing the program, recruiting clients, conducting individual or group meetings with clients, holding workshops for staff; and sponsoring conferences.
Outputs —units of service or products. These include the number of clients and the number of hours devoted to providing services. The outputs, produced by activities, are expected to lead to outcomes.
Outcomes —events, occurrences, or conditions representing an impact that the program has on the community outside of the organization. Outcomes can be immediate, intermediate, long term or ultimate.
To see how you can apply a logic model to a specific program, consider a program to enable low-income families to obtain affordable, adequate housing. Its long-term goals are to increase the number of low-income families in adequate, affordable housing and to reduce the number of families who are homeless or living in substandard housing. Figure 4.2 shows a logic model for the program. Moving from the left to the right, think of each component as providing the condition for the next component. For example, what resources are necessary to operate the program? If you have these resources you can carry out the activities needed to meet the expected demand for services; if you provide the necessary services to families, participants will benefit by finding adequate housing.
|
|
FIGURE 4.1 The General Structure of a Logic Model |
|
|
FIGURE 4.2 A Logic Model Applied to a Housing Program |
Consider the program in Figure 4.2. To monitor activities and to assess how well the program is doing you would collect information on some components. A performance-monitoring system does not try to measure everything. You need to balance the benefits of obtaining information against the time and costs of doing so. Trying to obtain data on too many components may take too much time, drive up costs, and limit the quality of the data. Delivering a service typically requires several activities. For example, the housing program must recruit, notify, enroll, and counsel clients. You, along with a management team, must decide which activities are important to track to improve program operations and to help further the organization’s mission.
In creating a logic model you may want to distinguish among short-term, intermediate, and long-term outcomes. Short-term and intermediate outcomes are those that should lead to a desired end but are not ends in themselves. Response times for medical teams are short-term outcomes expected to lead to the long-term outcome of reduced preventable fatalities. Other examples of long-term outcomes are an increase in the number of low-income families living in safe, affordable housing, a decline in juvenile crime, and a decrease in the incidence of disease.
Service quality is considered an intermediate outcome. Quality measures recipient perceptions about how well a service was delivered. Consider a program intended to prepare students to advance into leadership positions. Evidence of their successes may not be available soon enough to be measured and linked to the program. So students’ perceptions about the quality of the program and how well they believe it will help them are treated as intermediate outcomes. Measuring these perceptions will help the program managers gauge if the program is working.
Three terms often associated with performance measurement are indicators, workload, and equity. Indicators are indirect measures of a condition or characteristic. For example, housing quality may be an indicator of the family income of the residents, but it does not measure family income directly. Workload refers either to activities or to outputs, such as the number of applications processed or the number of clients counseled, respectively. Workload data are valuable for planning, budgeting, and performance measurement. You should also be concerned about the clients receiving a program’s services and eligible clients who are not. Uncovering the reasons for not receiving services provides valuable information. Equity refers to the fairness in providing program services. A widely held contemporary value is that social services should be available to all eligible persons without regard to race, ethnic group, religion, gender, or sexual orientation. For example, to demonstrate equity a food pantry may track the number of recipients by racial or ethnic group. You may want to go beyond showing that program services are widely available and used. Categorizing a program’s outputs and outcomes by salient population characteristics may identify equity problems. The food pantry, for example, should be concerned if 50 percent of its recipients are Hispanic but only 10 percent of its recipients who adopt a nutritious diet are Hispanic.
Performance measures by themselves cannot confirm that a program was successful. Outcomes may improve because of other changes in the environment. For example, an increase in employment may be due to a better local economy, not the effect of job-training programs. Performance measures seldom provide answers to what should be done to improve the outcomes, for example, if nutritious diets are not adopted, if crime increases, or if disease spreads. Performance measures can, however, alert managers to problems and generate important discussions about what is causing disappointing outcomes.
As a manager you will want to record on a regular basis the amount of resources used. You will probably also want to measure how well the program is using its allocated resources. For a food pantry, for example, you could record the total amount of food distributed each day. You could also determine the average cost of the food provided to each individual or family. As manager of a housing program you could measure the total cost of providing counseling to clients and determine the average cost per client. Since much of the cost of providing counseling is the time spent by counselors you might decide to determine the number of hours counselors spend per client. To assess your costs or time in providing services you would want to see how you are doing compared to earlier times or other programs. Typically managers compare their program to similar programs or to their program’s previous performance.
Common performance measures for how well organizations or programs use resources are efficiency and productivity. Efficiency relates the amount of input to the amount of output or outcome. For example, the average cost (an input) of providing food to a family (an output) would be an efficiency measure. The average cost of improving a family’s nutritional status (an outcome) would also be an efficiency measure. Productivity, on the other hand, is the ratio of the amount of output (or outcome) to the amount of input. The average number of clients counseled by each counselor per month would be a productivity measure. Efficiency and productivity have traditionally linked costs and outputs. A performance-monitoring system that provides data on outcomes gives a more useful picture of efficiency and productivity. Output-to-input ratios can be affected by increasing output at the expense of results (outcomes) and the quality of service. Consider job placement services. A service may report high efficiency because it counsels a large number of clients (output) but at the same time it may place people in low-paying, dead-end jobs (an outcome).
Efficiency and productivity are often used as indicators of performance. In general we would accept greater efficiency and productivity as indicators of better performance than lower efficiency or productivity. However, we may not be able to judge performance unless we can compare these to similar measures in other organizations or to our own in previous months and years.
Here are sample data that a housing program can use to measure and monitor performance. (Note how they link to the program logic model). We show here how the program manager can use this data to measure efficiency and productivity.
Inputs
a. Cost of counseling clients in a 3-month period: $45,000
b. Number of counselor hours devoted to housing program in a 3-month period: 768
Outputs
c. Number of clients counseled in the housing program in a 3-month period: 203
d. Number of clients who completed a credit counseling program: 47
Outcomes
e. Number of persons who obtained affordable, standard housing: 68
f. Number of persons whose credit scores increased after receiving credit counseling: 38
Output-Related Efficiency Measures
Cost of counseling clients/number of clients counseled
= $45,000/203 = $221.67 per client counseled
Time spent counseling clients/number of clients counseled
= 768/203 = 3.8 hours per client counseled
Outcome-Related Efficiency Measure
Cost of counseling clients/number who obtained housing
$45,000/68 = $661.76 per person obtaining housing
Outcome-Related Productivity Measure
Number who obtained housing/cost of providing program
= 68/$45,000 = 0.00151 or, 1.51 clients obtained housing for each $1,000 of program cost
In addition to monitoring the amount of resources and the number of clients counseled, managers develop indicators of how well their programs convert resources into outputs or outcomes. Managers might track in 3-month intervals the unit cost of counseling clients and the unit cost of counseling clients who obtained housing. Together these indicators allow managers to monitor whether they are using more or less resources than planned and if the average cost of serving each client is changing.
Because performance monitoring requires regular data collection you must pay attention to data quality. You want to make sure that the measures are reliable or you risk incorrectly comparing two time periods or two programs. Also, you can’t ignore the cost of collecting, compiling, storing, and analyzing data. You may have to rely on other staff members to collect the data. They should be convinced that the data are worthwhile, otherwise they may report incomplete or erroneous data. In the following section, we identify five common sources of data to monitor performance. For the most part each data source has the added advantage of having a value beyond just monitoring performance.
Organizations and programs routinely keep data on customers and clients. The records contain information on inputs (e.g., costs), outputs (e.g., services provided), and key client demographics (e.g., age, gender, ethnicity.). The records may contain information from staff members who submit regular performance reports, for example, the number of clients served, the types of services offered, and whether particular clients will continue to receive services.
Agency records have the advantages of availability and low cost. However, the information may not be easily converted into performance measures; the records may contain little or no information on quality; they may not indicate when the service was delivered; and a program manager’s access may be limited because of confidentiality.
User surveys, focus groups, and interviews are all used to learn about customers’ experiences with a program and their level of satisfaction. A user survey contacts samples of customers or clients. Respondents can provide data to measure outputs (services received), outcomes, and satisfaction level. Households in a community may be surveyed to identify potential or actual demand, which allows stakeholders to assess whether a program’s outputs and outcomes are adequate.
Trained Observers and Rating Systems
Trained observers may use rating systems to assess outcomes that are more qualitative. For example, a community revitalization program may create rating scales and have raters assess the cleanliness of streets, the appearance of yards, and the physical condition of residences.
The U.S. Census Bureau records provide useful information on community characteristics that can be used for performance information. The decennial census (completed every ten years) includes demographic information on the following: ethnicity, age, and gender; the number of homeowners and renters; and household size. More current and extensive information is included on the American Community Survey conducted every year.2 Each state has a Data Center, created as a partnership between the Census Bureau and the state government, which provides users with access and education on census data and other statistical resources.3
State and local governments are also potential sources of data. Counties record information such as births, deaths, and chronic diseases for their jurisdictions and pass this information on to the state. State government Web sites typically describe various sources of state and local statistical data and have links to these sources. The data available includes existing and projected population numbers, other demographic information, housing, maps, and much more.4
Regular data collection and tracking are key characteristics of performance measurement. The more frequently data are gathered the higher the cost. This is especially true of surveys or rating systems which can be quite costly each time the data are gathered. The drive for accountability and performance information for budget planning seems to favor annual reporting. More frequent reporting may occur if a measure lends itself to frequent review and adjustment, that is, as part of continual program improvement.
Quantitative information collected regularly over a long time constitutes a time series. The data can depict both short-term changes and long-term trends in a variable. Most people are familiar with time series that regularly report on some aspect of the nation’s economy or a social condition. The unemployment rate, crime rate, number of people below the poverty level, and number of new homes sold are examples. Time series are suited for situations when you want to
■ describe changes over time;
■ monitor trends;
■ forecast future trends;
■ estimate the impact of a program or an intervention;
■ establish a baseline for future comparisons.
Housing program managers will probably want to track the number of clients that are enrolled or seen each month, each quarter—every three months—and every year. Since goals of the organization include reducing the number of homeless families and the number living in substandard housing, program managers will want regular information on these indicators for the community to see if they have changed. With this information managers document changes in the need for services and plan for future services. Time series data presented in graphs and tables are generally easy to interpret and are useful in routine decision making. For example, managers of the housing program may want information on the number of clients enrolled and served and the length of service time to help them allocate resources. A time series might include an explicit independent variable. For example, housing office staff might analyze a time series of the number of people in need of adequate housing at various times before and after an event such as a major area employer going out of business. The employer’s closing would represent the independent variable and the number of people needing housing the dependent variable.
|
|
FIGURE 4.3 National Unemployment Rate: 1980–2009 |
The data are typically presented in a graph with time along the horizontal axis (X axis) and a variable’s values along the vertical axis (Y axis). You should focus on the variable and its changes or variations over time. Figure 4.3 shows the annual unemployment rate over nearly 30 years. This example reports data for a sample of U.S. households. You could develop a similar graph for a single state, county, or city sample and for different time periods.
Time series are useful to show the changes in a variable. As you look over the graph in Figure 4.3, look for patterns—dips and rises—in changes in the unemployment rate. Four types of changes or variations occur within a time series:
1. Long-term trends. General movement of a variable, either up or down, usually for 5 or more years.
2. Cyclical variations. Regular up-and-down changes in a variable that occur within a long-term trend; one- to five-year cycles are common although longer cycles are often observed.5
3. Seasonal variations. Fluctuations traceable to seasonally related phenomena such as holidays, weather, or the beginning of the school year.
4. Irregular (or random) fluctuations. Changes that cannot be attributed to long-term trends, cyclical variations, or seasonal variations.
The example in Figure 4.3 demonstrates a long-term trend, with cyclical variations and irregular fluctuations, which we discuss in the following section. Since the data are annual you will not be able to identify seasonal variations in this series.
Between 1980 and 2006, the unemployment rate trended downward. However, within that downward trend we see several increases and decreases. In 2006 it leveled off and in 2008 and 2009 increased dramatically. Variables seldom increase or decrease indefinitely. Changes in social, economic, environmental, and political conditions may lead to short- or long-term changes in a variable. For example, between 1980 and 1985, the unemployment rate increased and then decreased. It continued to decrease until 1989 when it increased for 3 years, then decreased, and continued to do so until 2000. Between 2000 and 2005 we again see an increase and decrease and then the dramatic increase in 2008 and 2009. When we started writing this text we couldn’t tell if the 2008 spike was an irregular fluctuation or if it signaled a longer period of relatively high unemployment. Now in 2010, the 2008 change at best marks a cyclic shift in direction or at worst the beginning of a long-term trend. By the time you read this you may have additional information and will be in a better position to judge.
Cyclical variations are regularly occurring fluctuations within a long-term trend that last for more than a year. A complete cycle is from “peak to peak” or “valley to valley” in a time series. Sometimes cycles are very regular. For example, the percentage of Americans voting peaks every 4 years, during presidential elections. The variations illustrated by the unemployment rate are more commonly seen than are regular cycles. Up- and- down-movements take place throughout a time series but the cycles may not be of the same length and may last longer than 5 years. The unemployment data for the period 1980–2005 have two cycles of approximately 10 years each. You see long stretches when the unemployment rate decreased. Between 1980 and 2005 periods of increasing unemployment rarely lasted more than 3 years. Perhaps political interventions brought about a change. For those who use advanced statistical tools to analyze time series and use them to forecast, regular cycles are necessary. However, program managers can obtain useful information from understanding that these changes take place. Awareness of these variations can help the manager avoid incorrectly attributing changes to an action of the organization or in concluding that a long-term trend has changed.
Seasonal variations describe changes that occur within the course of a year. Data must represent time intervals that recur within a year, such as days, weeks, months, or quarters (3-month periods). The observed fluctuations occur within a single year and recur year after year. If the data in Figure 4.3 were monthly, you would have seen several fluctuations in any given year. Such changes are considered seasonal variations only if a similar pattern is seen year after year. An administrator may use seasonal information to decide how to staff facilities—how many staff to hire for city parks and for what length of time, for example, for 10, 12, or 14 weeks. A program manager administering an emergency shelter that secures temporary housing for homeless men would know that more individuals will need housing during certain times of the year. Ignorance of seasonal demands may result in erroneous decisions. Imagine the consequences if a merchant assumed that December sales marked a business upswing that would carry through the next several months.
Irregular fluctuations are variations not associated with long-term trends, cyclical variations, or seasonal variations. The irregular variations may be nonrandom or random. A nonrandom fluctuation results from conditions that can be identified and that explain the variation. The conditions may be inferred from records of concurrent events. For example, a news story about an emergency shelter or the occurrence of a flood or other natural disaster may explain an increase in homeless persons seeking shelter. Random movements are those that seemingly occur with no explanation that can be found. They are unpredictable and tend to be relatively minor
To infer that a program made a difference you might compare the difference in the time series before and after the program was implemented. You might look at a time series to help determine the impact of a major event, such as a flood, an earthquake, or an employer closing. You might consider outcomes such as changes in unemployment, the number of homeless persons, and evidence of psychological depression.
Social indicators quantify social conditions and are often presented as a time series. The unemployment rate shown in Figure 4.3 tracks a nationwide social condition. The infant mortality rate shown in Figure 4.4 tracks a statewide social condition. (You should be able to identify the long-term trend and some cyclical variations in this time series.)
Leaders of nonprofit organizations and government policy makers may not be able to affect social conditions in a short time. However, the time series can alert them to serious problems, and awareness of possible changes in the time series will help them gauge if changes are likely to be long term.
|
|
FIGURE 4.4 Infant Mortality Rate in a Large State: 1993 to 2006 (Number of Deaths per 1,000 Live Births) |
If government leaders in the state recording the data in Figure 4.4 were concerned about the long-term level of infant mortality, they would be relieved that it appears to be declining. However, during the time span shown the mortality rate increased before continuing its decline. When an increase is observed officials can investigate reasons for the increase and take appropriate actions.
This chapter began with a discussion of how organizations use logic models to develop programs and to provide a framework for measuring program performance. The components of the logic models identify steps in delivering service and achieving desired ends. These components help managers identify how to monitor the use of resources, organizational activities, and their results. Logic models link program planning and performance monitoring. The chapter identified the types of measures used to monitor performance and where to find data.
The chapter concluded by describing time series, a common strategy for using performance data. Time series can track the use of resources, activities, outputs, and outcomes. Understanding the variations in time series can enable you to respond to changes in a program and its environment. As we will explain in Chapter 10 a time series cannot demonstrate that a program caused observed changes in social conditions. Chapter 5 turns to a discussion of the statistics used to analyze the data collected as organizations and programs monitor performance.
Hatry, Harry, Phil Schaenman, Donald Fisk, John Hall, Louise Snyder. How Effective Are Your Community Services? Procedures for Performance Measurement, Third Edition (Washington, D.C.: The Urban Institute Press, 2007).
Urban Institute, Outcome Indicators Project, http://www.urban.org/center/cnp/projects/outcomeindicators.cfm
The Urban Institute and The Center for What Works have an outcome indicators project with suggested outcome indicators for 14 social programs. The information is available at http://www.urban.org/center/cnp/projects/outcomeindicators.cfm
W. K. Kellog Foundation, Logic Model Development Guide: Using Logic Models to Bring Together Planning, Evaluation, and Action (Battle Creek, Michigan: W. K. Kellog Foundation, 2004)
CHAPTER 4 EXERCISES
There are four separate exercises for Chapter 4. Each exercise develops your competence in interpreting and applying measurement concepts.
• Exercise 4.1 Childhood Vaccination Program focuses your attention on logic models, the components of a model, and choices in deciding what to measure.
• Exercise 4.2 Comparing Two Job Training and Placement Programs focuses on cost efficiency. This exercise requires a cost benefit analysis of two programs’ outputs and outcomes to determine the more efficient program.
• Exercise 4.3 Variations in Unemployment requires you to explore and interpret time series data.
• Exercise 4.4 On Your Own provides an opportunity to work with these concepts on your own or in your own agency.
EXERCISE 4.1 Childhood Vaccination Program
Scenario
The Mid Valley Family Health Center is launching an aggressive antiflu campaign in the communities served by its clinics. The campaign will consist of two major activities: a publicity campaign about behaviors that can prevent the spread of flu and providing flu shots at minimal or no cost. The clinics, which serve low-income households, are located in racially and ethnically diverse communities.
Section A: Getting Started
1. Write a long-term outcome for this initiative.
2. Create a logic model for this initiative. (You may present the two activities on one model or separate models.) Be sure to include information about the inputs, activities, outputs, and outcomes for the activities.
3. Suggest three benefits of drawing a logic model.
4. Consider the plan to provide flu shots and suggest how you would measure
a. input
b. activity
c. output
d. impact
e. workload
f. program quality
g. equity.
5. Consider the plan to provide flu shots. What would you need to measure and how often would you collect the data for the following purposes?
a. To estimate how many staff you need at each site
b. To determine if program participation is equitable
c. To estimate your budget for next fiscal year
d. To include in your annual report to the board of directors
Section B: Small Group and Class Exercise
In groups of three to five prepare a presentation of the logic model to the program initiative’s management team.
1. Compare your logic models (Section A) and decide on a model for your presentation.
2. Prepare your presentation.
a. Comment on the value of the logic model to the Health Center and its staff.
b. Identify the long-term outcome and how it might be measured.
c. Suggest what components you would suggest that the Health Center track
|
i. |
at least quarterly |
|
ii. |
at least annually |
d. For what quality indicator(s) would you suggest gathering data? Justify its (their) importance.
e. For what equity indicator(s) would you suggest gathering data? Justify its (their) importance
Class Exercise
■ One group should make its presentation to the class.
■ One or two groups should play the role of the management team and ask questions for clarification, to identify challenges or problems with the model or with implementing the plans for monitoring performance. The “management team” should point out strengths of the proposal.
■ All other students will act as observers and assess each group’s performance, focusing on those aspects of the presentation, questions, and answers that were particularly compelling.
EXERCISE 4.2 Comparing Two Job Training and Placement Programs
Performance information can be used to measure efficiency. The ratio of the amount of input to the amount of output is efficiency, and the ratio of the amount of output to the amount of input is productivity. For this exercise calculate the efficiency in terms of dollar costs.
1. Consider the job training and placement programs administered by two different organizations. The programs’ respective output, outcome, and input measures are given below. Calculate the output and outcome efficiency of the program for a 3-month period.
Output measures
a. Number of clients who received job training and placement services in a 3-month period (833)
Outcome measures
b. Number of persons who obtained a job (616)
Inputs
c. Number of dollars expended to provide services to clients in the job training and placement program in the 3-month period ($150,000)
2. Next let’s compare this job training program (Program A) to another program (Program B) with the same goals. Calculate the efficiency of Program B, whose output, outcome, and input measures are as follows.
Output measures
a. Number of clients who received job training and placement services in a 3-month period (150)
Outcome measures
b. Number of persons who obtained a job (68)
Inputs
c. Number of dollars expended to provide services to clients in the job training and placement program in the 3-month period ($50,000)
3. Identify the more cost-efficient program.
4. Explain how efficiency might be different if the target population for Program A was welfare mothers and for Program B was graduating high school seniors.
5. How may the quality of the job affect program efficiency? Consider that jobs can range from low-skilled temporary jobs to skilled permanent jobs.
EXERCISE 4.3 Variations in Unemployment
Generally job training programs are popular during times of high unemployment. Job training programs use the unemployment rate to justify services, particularly increases in services. The following table contains the U.S. unemployment rate data from January 2006 to December 2009. How would you arrange these data in a time series? Graph these data. Consider putting months on the X axis and creating a separate line for each year.
Source: U.S. Department of Labor, Bureau of Labor Statistics, Labor Force Statistics from the Current Population Survey.
Write a brief statement describing these data as they are depicted in the graph you have created. Be sure to discuss any long-term trends, cyclical variations, seasonal variations, or irregular (or random) fluctuations. For any random fluctuations, can you suggest economic, political, or social factors that may have impacted these changes? How could the job training program use these data?
Identify a program at your job or internship.
1. Based on a review of program materials and staff interviews, draw a logic model. Focus on
a. long-term objectives.
b. the activities conducted to achieve objectives.
c. the input needed to achieve the objectives.
2. Review your model with a staff member and get his or her feedback as to its accuracy, clarity, and usefulness.
3. Make a recommendation for measuring program quality.
4. Make a recommendation for measuring program equity.
5. What measures does the agency currently track about the program? Why are these collected?
6. What data should be collected to demonstrate effectiveness? How often should these data be collected?
7. Use data that your agency collects on a regular basis. Create a chart, graph these data, and describe the fluctuations
NOTES
1E. Gentile and S. A. Imberman. “Dressed for success: Do school uniforms improve student behavior, attendance, and achievement?” University of Houston Working Paper 2009-03, posted at www.uh.edu/econpapers/RePEc/hou/wpaper/2009-03.pdf
2http:www.census.gov/; http:www.census.gov/acs/www/
3The State Data Center Program is described at http://www.census.gov/sdc/
4For example, a North Carolina State Government site provides a Web resource for North Carolina data with over 1,300 data items from state and federal agencies. See http://linc.state.nc.us/
5M. Kendall and J. K. Ord. Time Series, Third Edition (New York: Oxford University Press, 1990). Kendall and Ord identify four-year cycles. Business cycles are often between three and eight years.