Discussion 4
Evaluation Strategies
for Human Services
Programs
A Guide for Policymakers
and Providers
Adele Harrell
with
Martha Burt
Harry Hatry
Shelli Rossman
Jeffrey Roth
William Sabol
Contents Clarifying the Evaluation Questions, 2
Developing a Logic Model, 3
Assessing Readiness for Evaluation, 7
Selecting an Evaluation Design, 8
Identifying Potential Evaluation Problems, 27
Conclusions, 30
EXHIBITS
Exhibit A: Logic Model Used in Evaluation of the Children At Risk Program, 6
Exhibit B: Process for Selecting Impact Evaluation Designs, 18
Evaluation Strategies for
Human Services Programs A Guide for Policymakers and Providers
The Urban
Institute
Washington, D.C.
In the continuing effort to improve human service programs, funders, policymakers,
and service providers are increasingly recognizing the importance of rigorous program
evaluations. They want to know what the programs accomplish, what they cost, and
how they should be operated to achieve maximum cost-effectiveness. They want to
know which programs work for which groups, and they want conclusions based on
evidence, rather than testimonials and impassioned pleas.
This paper lays out, for the nontechnician, the basic principles of program evaluation
design. It signals common pitfalls, identifies constraints that need to be considered,
and presents ideas for solving potential problems. These principles are general and
can be applied to a wide range of human service programs. We illustrate these
principles here with examples from programs for vulnerable children and youth.
Evaluation of these programs is particularly challenging because they address a wide
diversity of problems and possible solutions, often include multiple agencies and
clients, and change over time to meet shifting service needs.
Steps in Selecting the Appropriate Evaluation Design. The first step in the process of
selecting an evaluation design is to clarify the questions that need to be answered. The
next step is to develop a logic model that lays out the expected causal linkages
between the program (or program components) and the program goals. Without
tracing these anticipated links it is impossible to interpret the evaluation evidence that
is collected. The third step is to review the program to assess its readiness for
evaluation. These three steps can be done at the same time or in overlapping stages.
For expositional clarity we will discuss each of them in turn. We will then describe
how to select the best design for a given purpose from among the major types of
evaluation that exist.
Clarifying the Evaluation Questions The design of any evaluation begins by defining the audience for the evaluation
findings, what they need to know, and when. These questions determine which of the
following four major types of evaluation should be chosen:
Impact evaluations focus on questions of causality. Did the program have its
intended effects? If so, who was helped and what activities or characteristics of the
program created the impact? Did the program have any unintended consequences,
positive or negative?
Performance monitoring provides information on key aspects of how a system or
program is operating and the extent to which specified program objectives are being
attained (e.g., numbers of youth served compared to target goals, reductions in school
dropouts compared to target goals). Results are used by service providers, funders,
and policymakers to assess the program's performance and accomplishments.
Process evaluations answer questions about how the program operates and document
the procedures and activities undertaken in service delivery. Such evaluations help
identify problems faced in delivering services and strategies for overcoming these
problems. They are useful to practitioners and service providers in replicating or
adapting program strategies.
Cost evaluations address how much the program or program components cost,
preferably in relation to alternative uses of the same resources and to the benefits
being produced by the program. In the current fiscal environment, programs must
expect to defend their costs against alternative uses.
A comprehensive evaluation will include all these activities. Sometimes, however, the
questions raised, the target audience for findings, or the available resources limit the
evaluation focus to one or two of these activities.
Whether to provide preliminary evaluations to staff for use in improving program
operations and developing additional services is an issue that needs to be faced.
Preliminary results can be effectively used to identify operational problems and
develop the capacity of program staff to conduct their own ongoing evaluation and
monitoring activities.(1) But this use of evaluation findings, called formative
evaluations, presents a challenge to evaluators who are faced with the much more
difficult task of estimating the impact of an evolving intervention. When the program
itself is continuing to change, measuring impact requires ongoing measurement of the
types and level of service provided. The danger in formative evaluations is that the
line between program operations and assessment will be blurred. The extra effort and
resources required for impact analysis in formative evaluations has to be measured
against the potential gains to the program from ongoing improvements and the greater
usefulness of the final evaluation findings.
Developing a Logic Model It is impossible to interpret evaluation findings without a clear understanding of
program goals, implementation sequences, and the expected links between them and
expected program benefits. Expectations about these linkages are made explicit by
developing a logic model. Such a model is developed by discussing with service
providers and funders the goals of and rationales behind program organization and
content, examining planning documents and program reports, and reviewing research
findings on similar programs or problems. The literature review may be particularly
helpful in identifying plausible causal links and any factors other than the program
which should be considered in the evaluation.
The logic model provides a simplified description of the program, the
intended outputs, and the intended outcomes. Program characteristics include the
population to be reached, the resources to be used, and identification of the types and
levels of service elements. Outputs are immediate program products resulting from the
internal operations of the program, such as the delivery of planned services. Examples
of output indicators in the area of programs for vulnerable children and youth might
include the numbers of children immunized, home visits by case managers, or youth
completing a job training program. These program outputs are, in turn, the vehicle for
producing the desired program outcomes, for example, decreases in childhood
illnesses, decreases in abuse and neglect cases, or increases in youth employment.
Careful attention must be paid to when the anticipated outcome should be expected to
occur. For this reason it is often useful to divide outcomes into intermediate versus
longer term. For example, improved school attendance in early grades might be an
intermediate outcome associated with the longerterm outcome of dropout prevention.
Care must be given to focusing on outcomes which will occur within the study period.
A classic failure in selecting an outcome that is expected to occur within the time
frame of the study occurred in evaluations of the DARE drug prevention program, an
educational program for fifth and sixth graders designed to prevent drug use.
Evaluation results showed no significant prevention of drug use at the end of the
program. This result should have been anticipated, since drug use does not typically
begin among youth in this country until the mid-teen years (14 to 17). An age-
appropriate intermediate outcome should have been selected as the primary outcome
measure, such as improved peer resistance skills and changes in beliefs about the risks
of drug use.
The logic model should also include explicit mapping of the conditions present in the
program environment or characteristics of the target group or community that may
affect the program's ability to achieve its goals. Non-program characteristics of the
program organization, community or target population that are likely to influence the
outputs and outcomes and/or use of program services are called antecedent
variables. Conditions or events in the program, target population, or community that
may limit or expand the extent to which program outputs actually produce the desired
outcomes are called mediating variables. For example, a drug abuse prevention
program may be less effective if the program staff are inexperienced, or if the local
community offers fewer recreational alternatives to substance abuse and/or more
active open drug markets (antecedent variables). Offering other support services in
combination with the program may enhance its impact (a mediating variable).
In impact evaluations the logic model is used to spell out how, and for whom, certain
services are expected to create specific changes/benefits. For example, if the program
includes parenting classes, the logic model will identify this activity as a key program
component and show the types of changes in parenting that will be used to measure
program outcomes (e.g., by improving parental assistance with homework or helping
parents communicate more effectively with adolescents).
In performance monitoring, the logic model is used to focus on which kinds of output
and outcome indicators are appropriate for specific target populations, communities,
or time periods. For example, among indicators of child improvement in school, one
might expect attendance to improve in the first semester of a program, but academic
test score improvement only after a significant period of program participation-with
the timing possibly varying by the age and developmental stage of the children.
In process evaluation, the logic model is used to identify expectations about how the
program should work-an "ideal type"-which can then be used to assess the deviations
in practice, why these deviations have occurred, and how the deviations may affect
program outputs. This assists program managers (and evaluators) to identify
differences (including positive and negative unintended consequences), consider
possible mechanisms for fine-tuning program operations to align the actual program
with the planned approach, or re-visit program strategies to consider alternatives.(2)
Logic models are constructed to show temporal sequences, building left to right, and
they typically diagram relationships with arrows. An example of a logic model is
shown in Exhibit A. It was developed by the Urban Institute during the planning of
the evaluation of the Children At Risk program (CAR). CAR is an intensive
intervention program designed to prevent involvement in drugs and crime, and to
foster healthy development among adolescents ages 13 to 15 who exhibit serious risk
indicators and live in severely distressed inner-city neighborhoods.
The intervention consists of eight required program components:
Case Managers employed by the program make a service plan for all members
of the household of participating youth and provide intensive follow-up on
referrals to needed services, handling a. caseload of 15;
Family Services include parenting skills training for all parents, and referral to
other services as needed (intensive family counseling, stress
management/coping skills training, identification and treatment of substance
abuse, health care, job training and employment programs, housing, and
income support services);
Education Services include tutoring or homework assistance for all youth, and
referral to other services as needed (educational testing, special education
classes);
After-School and Summer Activities for all CAR youth include recreational
programs and life-skill/leadership development activities, combined with
training or education;
Mentoring is provided by local organizations for youth in need of a caring
relationship with an adult. The role of the mentor is to: (a) inform youth about
alternative available choices (e.g., activities and goals); (b) familiarize them
with strategies available for pursuing those choices; (c) provide training,
opportunities for practice, and feedback in the development of skills for
implementing particular strategies; and (d) provide relationships through which
youth are affirmed, inspired, and encouraged to make healthy choices;
Incentives such as gifts and special events are used to build morale and
attachment to the pro-social goals of the program (e.g., gift certificates, trips,
and vouchers for pizza, sports shops, movies, and stipends for community
service during summer programs);
Community Policing/Enhanced Enforcement is used in all target
neighborhoods to create safer environments with less drug activity. Law
enforcement activities include out-stationing police in schools and
neighborhood locations to maintain order and enhance relationships with
community groups;
Criminal/juvenile Justice Intervention involves collaboration between case
managers and juvenile court personnel to provide community service
opportunities and enhanced supervision of youth in the justice system.
Antecedent variables include the levels and types of neighborhood, family, peer
group, and personal risk factors for participants as well as their demographic
characteristics. These are influences that are present before the program intervention.
Mediating variables include exposure to other social or educational services,
perceptions of opportunities, and social norms. These are influences that
Exhibit A
Logic Model Used in Evaluation of the Children At Risk Program
operate at the same time as the program is operating. The program components are
designed to achieve the intermediate outcomes-reductions in risk factors and
enhancement of protective factors at the end of program participation. These
intermediate outcomes, measured at the end of program participation, are
hypothesized to be requisite steps towards the desired longer-term outcomes-
prevention of drug use, drug selling, delinquency, school failure and dropout, and teen
parenthood.
Program outputs, not shown in this diagram, include indicators of performance such
as the number of tutoring sessions provided, number of home visits by case managers,
and number of times parents participated in program activities.
Assessing Readiness for Evaluation
Evaluability assessment is a systematic procedure for deciding whether program
evaluation is justified, feasible, and likely to provide useful information. Questions to
be considered in an evaluability assessment include: (3)
Is the program's logic model plausible given the resources available and
guidance from the relevant literature? If program goals are unrealistic or the
intervention strategies not well grounded in theory and/or prior evidence, then
evaluation is not a good investment.
What kinds of data will be needed, from what number of subjects, and what
data are likely to be already available? Evaluations should be designed to
maximize the use of available data, as long as these are valid indicators of
important concepts and are reliable. Available data may, for example, include
government statistics, individual and summary agency records and statistics,
and information collected by researchers for other studies. If there are crucial
data needs not met with existing data, resources must be available to collect the
requisite new data.
Are adequate resources and assets available-money, time, expertise, and
community and government support? Are there any factors that limit or
constrain access to these resources?
Can the evaluation be achieved in a time frame that will permit the findings to
be useful in making program and policy decisions by federal, state, and local
officials?
To what extent does evaluation information already exist somewhere on the
same or a closely related intervention? The answer to this question can have
important implications for action. Any successful previous attempts may yield
promising models for replication. Lessons learned from previous unsuccessful
attempts may inform the current effort. If sufficient evidence already exists
from previous efforts, the value of a new evaluation may be marginal.
To what extent are the findings from an evaluation likely to be generalizable to
other communities, and therefore useful in assessing whether the program
should be expanded to other settings or areas? Are there unique characteristics
of the projects to be evaluated that might not apply to most other projects?
Program characteristics that are not generalizable reduce the value of any
findings.
Selecting an Evaluation Design
Selection of the evaluation design follows the systematic consideration of these
questions. As noted, there are four major types of evaluation: impact, performance monitoring, process, and cost. We discuss each in turn.
Impact Evaluation Designs
Three possible designs are possible for impact evaluations: experimental, quasi-
experimental, and non-experimental. They all share the strategy of comparing
program outcomes with some measure of what would have happened without the
program. Experimental designs are the most powerful and produce the strongest
evidence. These are not always possible, however, in which case one of the two other
alternatives must be chosen. (A later section discusses how to make the choice.)
EXPERIMENTAL DESIGNS
Key elements. Experimental designs are considered the "gold standard" in impact
evaluation. Experiments require that individuals or groups, such as classrooms or
schools, be assigned at random (by the flip of a coin or equivalent randomizing
procedure) to one or more groups prior to the start of services. The "treatment" group
or groups will be designated to receive particular services designed to achieve clearly
specified outcomes. If multiple treatment groups are designated, the outcomes for the
treatment groups may be compared to one another to estimate the relative impact of
the different services or the impact relative to a control group. A "control " group
receives no services. The treatment group outcomes are compared to control group
outcomes to estimate impact. Because chance alone determines who receives the
program services, the groups can be assumed to be similar on all characteristics that
might affect the outcome measures except the program. Any differences between
treatment and control groups, therefore, can be attributed with confidence to the
impacts of the program.
Design Variations. One design variation is based on a random selection of time
periods during which services are provided. For example, new services may be
offered on randomly chosen weeks or days. A version of this approach is to use "week
on/week off" assignment procedures. Although not truly random, this approach
closely approximates random assignment if client characteristics do not vary
systematically from week to week. It has the major advantage that program staff often
find it easier to implement than making decisions on program entry by the flip of a
coin on a case-by-case basis. A second design variation is a staggered start approach -
in which some members of the target group are randomly selected to receive services
with the understanding that the remainder will receive services at a later time (in the
case of a school or classroom, the next semester or month). One disadvantage of the
staggered start design is that the observations of outcomes are limited to the period
between the time the first group completes the program and the second group begins.
As a result, it is generally restricted to assessing gains made during participation in
relatively short-term programs.
Limitations/Considerations. Although experiments are the preferred design for an
impact evaluation on scientific grounds, random assignment evaluations are not
always the ideal choice in real-life settings. Some interventions are inherently
impossible to study through randomized experiments. Youth curfews, for example,
cannot be enforced against a randomly selected subset of children in a community.
And "week on/week off" enforcement is likely to breed contempt for both the law and
enforcement.
A second consideration is whether random assignment is ethical and acceptable to the
community. Public opinion may resist treating similar children differently on the basis
of a coin flip or may view random assignment as exploiting vulnerable populations
and powerless people. Carefully designed procedures for randomization may be able
to overcome such resistance. One strategy is random selection of these to receive
services from a list of those who meet eligibility requirements when resources are not
available to serve everyone who is eligible. This form of drawing lots is close enough
to "first come, first served" to be accepted as fair in many situations. Providing
services for some clients at a later time (the next month or semester as described
above) may satisfy community concerns about fairness and be consistent with
available staff and resources. Sometimes, random assignment can involve relaxing a
requirement instead of adding one, which makes randomization less controversial.
Great care needs to be taken to ensure that the control group is not denied essential
services they would otherwise have, that the benefits to participants and the
community are carefully explained, and that program staff and participants understand
and support the research. Many funders require a formal review of the research design
by a panel trained in guidelines developed to protect research participants. Even when
such review is not required, explicit consideration of this issue is essential.
A third important issue is whether the results that are likely to be obtained justify the
investment. Experiments typically require high levels of resources--money, time,
expertise, and support from program staff, government agencies, funders and the
community. Evaluation planners have to ask themselves whether the answers to the
list of evaluation questions-and the decisions on program continuation, expansion, or
modification that will be made on the basis of the findings--could be based on less
costly, less definitive, but still acceptable evaluation strategies.
Practical Issues. Experimental designs run the most risk of being contaminated
because of deliberate or accidental mistakes made in the field. To minimize this
danger, there must be close collaboration between the evaluation team and the
program staff in identifying objectives, setting schedules, dividing responsibilities for
record-keeping and data collection, making decisions regarding client contact, and
sharing information on progress and problems. Active support of the key program
administrators, ongoing staff training and communication via meetings, conference
calls, or e-mail are essential.
Failure to adhere to the plan for random assignment is a common problem. Staff are
often intensely committed to their clients and will want to base program entry
decisions on their perceptions of who needs, or will benefit from, the program. To
prevent this pitfall, procedures should be set up so that the evaluator, not program
staff, is in charge of the allocation to treatment or control group. Statistical
adjustments in the analysis may be needed if there are operational failures to maintain
the randomization process(4). And even these may be inadequate to remove the biases
thus introduced.
Another potential problem area is noncomparable information for treatment and
control group members. Program staff can readily collect data and provide contact
information for treatment group members because they have continuing contacts with
clients, other agencies, and the community. Collecting comparable data and contact
information on control group members can be difficult. If the experiment loses track
altogether of more control than treatment group members, the evaluation data will not
only be incomplete, it will provide distorted and therefore misleading information on
what impacts the program has. The best way to avoid bias from this problem (called
differential attrition) is to plan tracking procedures and data collection at the start of
the evaluation, gathering information from the control group members on how they
can be located, and developing agreements with other community agencies, preferably
in writing, for assistance in data collection and sample member tracking. These
agreements are helpful in maintaining sample continuity in the face of staff turnover
at the agencies involved.
If the program services and content change over time, it may be difficult to determine
what level or type of services produced the outcomes. The best strategy is to identify
key changes in the program and the timing of changes as part of a process evaluation
and use this information to define "types of program" variations in the program
experience of different participants for the impact analysis. Other potential problems
may be solvable through the use of special statistical techniques. Such problems
include insufficient or unequal follow-up periods for treatment versus control,(5) and
the risk of events (e.g., failure in school, incarceration, injury, moving) that are more
likely to remove some types of members from a sample than others before the end of
the planned follow-up period.(6)
Example. The evaluation of Project Alert, an eight-week junior high school
curriculum for teaching seventh grade students to avoid drug use, used an
experimental design(7). Thirty California and Oregon schools were randomly
assigned to three groups: 1) students instructed by adult health educators, 2) students
instructed by older teenagers, and 3) a no-treatment control group, although four of
the non-treatment schools provided other drug prevention instructional programs. To
increase the generalizability of the findings, the schools were drawn from eight urban,
suburban, and rural communities and nearly a third of the schools had minority
populations of 50 percent or higher. To increase the pre-assignment similarities of the
three experimental groups and strengthen the statistical power of the analysis (given
the relatively small sample of schools), each experimental group was included in at
least one school in each community, and the schools included in the experiment were
matched to the extent possible to reduce differences among groups in such
characteristics as test scores, language spoken at home, drug use among 8th graders,
and ethnic and income composition. These procedures produced substantial pre-
experimental similarities in factors related to drug use among the experimental
groups. Since schools but not students were randomly assigned, statistical adjustments
were used to correct for the clustering of students within schools. Students completed
questionnaires about their drug use seven times between grades 7 and 12; those who
transferred to other schools or districts completed mail and telephone interviews to
minimize sample attrition. Outcome measures included cognitive risk factors
associated with drug use: beliefs about consequences of use, norms regarding drug
use, peer resistance, self-efficacy, and expected future drug use.
Experimental evaluations are costly. The Children At Risk evaluation, for example,
cost $1.5 million. But the rigorous design permitted strong conclusions about the
long-term effectiveness of drug prevention education during early adolescence and
demonstrated that results are not restricted to middle class communities, but can be used in schools with high proportions of lower income and minority students.
QUASI-EXPERIMENTAL DESIGNS Key Elements. Like experiments, quasi-experimental evaluations compare outcomes
from program participants to outcomes for comparison groups that do not receive
program services. The critical difference is that the decision on who receives the
program is not random. Comparison groups are made up of members of the target
population as similar as possible to program participants on factors that could affect
the selected outcomes to be observed. Multivariate statistical techniques are then used
to control for remaining differences between the groups.
Usually, evaluators use existing population groups for comparison-those who live in a
similar area, or are enrolled in the same school in a different classroom, or attended
the same school with the same teacher in the previous year. In some situations, staff
(or schools or communities) are willing or trained to try the new "treatment" while
others are not, but the same rules for service eligibility are used by all.
Design Variations. The primary variation is to construct a comparison group by
matching individuals to individuals in the treatment group on a selected set of
characteristics. This process for selecting a comparison group is methodologically less
defensible(8). The threats to validity are twofold. 1) Matches based on similarities at a
single point in time do not always result in groups of individuals who are comparable
over time. Thus, the groups may become increasingly different over time independent
of the program. 2) Differences in variables not used in the matching may have a
substantial effect independently of the program being evaluated.
Quasi-experimental designs vary in the number and timing of the collection of data on
program outcome measures. The selection of the number and timing of measurements
is based on an assessment of the potential threats posed by competing hypotheses that
cannot be ruled out by the comparison methodology. In many situations, the strongest
designs are those that collect pre-program measures of outcomes and risk factors and
use these in the analysis to focus on within-individual changes that occur during the
program period. These variables are also used to identify groups of participants who
benefit most from the services. One design variation involves additional measurement
points (in addition to simple before and after) to measure trends more precisely.
Another variation is useful when pre-program data collection (such as administering a
test on knowledge or attitudes) might "teach" youth about the questions to be asked
after the program to measure change, and thus distort the measurement of program
impact. This variation involves limiting data collection to the end of the program
period for some groups, allowing their post-program answers to be compared with the
post-program answers of those who also participated in the pre-program testing.
Considerations/Limitations. Use of non-equivalent control group designs requires
careful attention to procedures that rule out competing hypotheses regarding what
caused any observed differences on the outcomes of interest. In evaluations of
programs for vulnerable children and youth, three threats to validity stand out.(9)
The first is the threat of "maturation"--the possibility that age-related processes will
contribute to outcomes independently of the program intervention. Among youth,
certain outcomes, positive and negative, are strongly tied to age--outcomes such as
drug use, delinquency, and early parenthood. It is therefore necessary to be sure that
the comparison group is made up of youth at the same developmental stage.
A second threat is that of "history"--the risk that unrelated events may affect
outcomes. For example, the rapid spread of crack use among women childbearing age
in the United States in the late 1980s greatly increased rates of drug-exposed infants.
Thus, a comparison group for an evaluation of a prenatal health care program would
need to be drawn from the same years and communities to "control" for the spread of
crack. Otherwise, the upward trend in negative outcomes due to crack could obscure
the prevention benefits of the program. Similarly, designs need to consider controls
for geographic variation in events external to the program. For example, gang
crackdowns in some neighborhoods and not others could influence assessments of the
impact of a school-based delinquency or drug prevention program. If the crackdown
occurred in the "treatment" neighborhood, the program effects might be over-
estimated; if it occurred in the comparison neighborhood, program effects might be
under-estimated.
A third threat to validity is the process of "selection "-the factors that determine who
receives services. Some of these factors are readily identified and can be used as
control variables in statistical models, such as living in a specific school district or
meeting program eligibility criteria. However, it is unlikely that all factors will be
correctly identified and adequately measured. For example, program participants may
receive services because they are more motivated, skillful, or socially well connected
than nonparticipants. Such differences are not easy to measure during a program
evaluation.
Practical Problems. Building defenses or "controls" for threats to validity into
evaluation designs through the selection of comparison groups and the timing of
outcome observations is a challenge. Controls for maturation, history, and selection
may involve, respectively, selecting a sample that includes multiple age cohorts,
collecting data in similar or nearby localities that lack the program,(10) or applying a
statistical model that controls for foreseeable biases in selecting program
participants.(11) Even when the comparison group is carefully selected, the researcher
cannot be sure that all relevant group differences have been identified and measured
accurately. Statistical methods can adjust for such problems and increase the precision
with which program effects can be estimated,"(12) but they do not fully compensate
for the non-random design. Findings need to be interpreted extremely cautiously and
untested alternative hypotheses carefully considered.
As in experimental evaluation, plans for quasi-experimental evaluations need to pay
close attention to the problem of collecting comparable information on control group
members and developing procedures for tracking them. However, the need for close
collaboration with program staff is reduced, since the staff are generally neither
involved in selecting participants nor in contact with comparison group members.
Example. The evaluation of the Teen Age Parenting Program (TAPP) for adolescents
divided teen mothers into three groups designed to be similar in age and other
characteristics.(13) Each group was evenly divided among black, Hispanic, and white
participants. One group attended an alternative school with child development and
parenting classes and a nursery school featuring a parenting-child development
curriculum. Another group attended an alternative school without a nursery school.
The remaining group received no special services for teenage parents. Services began
during pregnancy. Assessments of educational progress, fertility, knowledge, and
child development two to four years later were based on interviews and school
records. Mothers in the alternative school with the nursery program had completed
more schooling and were more likely to still be enrolled in school than the other
mothers. Mothers in both alternative schools had more knowledge about parenting and
reproduction and more positive attitudes about parenting than those without special
services. But there were no significant differences in the groups on child development
outcome measures. How to interpret, this seeming inconsistency is complicated,
because the evaluation design did not have pre-program measures of individual
differences and assignment was not random. The education and knowledge
differences across the three groups may have been there from the beginning, rather than being attributable to the special services.
NON-EXPERIMENTAL IMPACT EVALUATIONS
Key Elements. Non-experimental impact evaluations examine changes in levels of risk
or outcomes among program participants, or groups including program participants,
but do not include comparison groups of other individuals or groups not exposed to
the program.
Design Variations. The four primary types of non-experimental designs include: 1)
before and after comparisons of program participants; 2) time series designs based on
repeated measures of outcomes before and after the program for groups that include
program participants; 3) panel studies based on repeated measurement of outcomes on
the same group of participants; and 4) post-program comparisons among groups of
participants.
The first two designs are based on analysis of aggregate data. In before and after
comparisons, outcomes for groups of participants (program groups that enter the
program at a specific time and progress through it over the same time frame) are
measured before and after an intervention and an assessment of impact inferred from
the differences. This simple design is often used to assess whether knowledge,
attitudes, or behavior of the group changed after exposure to a classroom curriculum
or job training program. Time series designs are an extension of the before and after
design that uses multiple measures of the outcome variables before an intervention
begins and continues to take multiple measures after intervention is in place. If a
change in the trend (direction or level) in the outcome occurs at, or shortly after the
time of the intervention, the significance of the observed change is tested statistically.
Time series measures may be based on larger groups or units that include but are not
restricted to program participants. For example, crime rates for neighborhoods in
which most or all youth participate in a delinquency prevention program might be
used to assess reductions in illegal activity. Evaluation of a series of dropout
prevention activities offered across the school year could examine the percentages of
entering classes that graduate over a period of years. Time series designs should be
considered when it is difficult to identify who receives program services or when the
evaluation budget does not support collection of detailed data from program
participants. Although new statistical techniques have strengthened the statistical
power of these designs,(14)" it is still difficult to rule out the potential impact of non-
program events using this approach.
The next two designs examine data at the individual level. Cross-sectional
comparisons are based on surveys of groups of participants conducted after program
completion. This design can be used to estimate correlations between outcomes and
differences in the duration, type, and intensity of services received, yielding
conclusions about plausible links between outcomes and services but no definitive
conclusions about what caused what. Panel designs use repeated measures of the
outcome variables for each individual. In this design, outcomes are measured for the
same group of program participants, often starting at the time they enter the program
and continuing at intervals over time. For example, the evaluation of Health Planning
and Promotion: Life Planning Education used pre-post data from participants to
measure gains in understanding the best combinations of contraceptive methods and
the consequences of early childbearing.(15) This design allows the characteristics of
individual participants to be used in the analysis to identify different patterns of
change associated with individual characteristics of participants and control for other
events to which they were exposed.
Considerations/Limitations. Several limitations to non-experimental designs should be
noted. First, the cross-sectional and panel designs provide only a segment of "dose-
response curve," that is, only estimates of the differences in impact related to
differences in the services received. These designs cannot estimate the full impact of
the program compared to no service at all, unless estimates can be based on other
information on the risks of the target population. Second, the designs that track
participants over time (before and after, panel, and time series) cannot control for the
effects of developmental changes that would have occurred without services, or for
the effects of other events outside the program's influence. Third, the extent to which
the results can be assumed to apply to other groups or other settings is limited,
because this design provides no information for assessing the extent to which
participants were selected into the program on the basis of factors which themselves
influence outcomes.
Practical Issues. Non-experimental designs have considerable practical advantages
because they are relatively easy and inexpensive to conduct. Individual data for cross-
sectional or panel analysis are often collected routinely by the program at the end (and
sometimes beginning) of program participation. When relying on program records, the
evaluator needs to review the available data against the logic model to be sure that
adequate information on key variables is already included, or to begin collecting
additional data items if needed.
When individual program records are not available, aggregate statistics may be
obtained from the program or from other community agencies with information on the
outcomes among groups of participants. For example, crime rates, average promotion
rates, and rates of births to teen mothers can be collected from existing records. The
primary problem encountered in using such statistics for assessing impacts is that they
may not be available for the specific population or geographic area targeted by the
program. Often these routinely collected statistics are based on the general population
or geographic areas served by the agency (e.g., the police precinct or the clinic
catchment area). The rates of negative outcomes for the entire set of cases included
may well differ from rates for the targeted group of vulnerable children and youth;
this risk is greater for larger rather than smaller statistical areas.
A more expensive form of data collection for non-experimental evaluations is a
survey of participants some time after the end of the program. These surveys can
provide much needed information on longer-term outcomes such as rates of
employment or earnings or high school graduation. As in any survey research, the
quality of the results is determined by response rate rather than overall sample size,
and by careful attention to the validity and reliability of the questionnaire items.
Example. The Youth Training Scheme (YTS) in Great Britain provides, through local
agents, two years of vocational and on-the-job training for out-of-school and
unemployed youth ages 16 and 17. The local agents are businesses or community
organizations that receive government funds to design a training program, recruit and
supervise youth, and provide at least 13 weeks of on-the-job training per year. Non-
experimental evaluation of YTS was based on a follow-up survey of 63,000 former
participants.(16) In addition to monitoring client satisfaction and job related
outcomes, the survey was used in non-experimental comparisons of differences in
outcomes related to differences among participants: job market outcomes were
compared for graduates versus program dropouts and across youth who entered the
program with different levels of motivation and past school achievement. Results
indicate that program graduates had better labor market outcomes than those who did
not complete the program. Similarly, earning qualifications in the program (an interim
outcome measure) was positively correlated with later labor market success (the
longer term outcome). Non-experimental comparisons were also used to identify
differences in outcomes related to characteristics of the participants or the training
experience. The field of employment and type of local agent providing the training
were significant predictors of labor market outcomes. Similarly, labor market
outcomes were better for youth who began the program with higher levels of
motivation and past school achievement. These findings are suggestive but not
definitive. Because of the non-experimental design, participating youth might have
been more likely to become employed than other youth even in the absence of the program.
CHOOSING AMONG THE IMPACT DESIGNS
Choice of an impact evaluation design begins by identifying the design that both
offers the strongest capacity for isolating the independent causal effects of the
program and is feasible given the structure of program. The "decision tree" shown in
Exhibit B illustrates a process for identifying which alternatives are feasible.
If the program will be provided to a limited number of youth who can be identified in
advance. and randomly selected for participation, then an experimental design should
be considered. If the program will be provided to a limited number of youth, but the
decision about who receives services is determined by organizational or geographic
considerations (or other nonrandom selection rules), then quasi-experimental design
variations should be considered.
The most difficult design challenges occur when the program is intended to serve all
members of the target population. If the new program is implemented fully and
rapidly, no youth will be available for a comparison group. Often, however, new full-
coverage programs-for example, new health services-are intended for an entire
population but not implemented in every community in the country, and certainly not
at the same time. If some communities or groups are not included in the initial
implementation, it may be possible to select as comparison sites communities that
have not implemented the program and use a quasi-experimental design. This may not
solve the problem of comparability sufficiently to allow such a design, however, if the
communities where it was implemented have characteristics that are systematically
different from those where it was not.
When non-experimental designs are necessary, the following can help guide the
choice of design. If a program is implemented at different levels across sites but
uniformly within sites, a cross-sectional design is suitable. If a target population is
exposed to different levels of the program within a community, a panel study design is
better-to follow a sample of individuals, and record both outcomes and the amount of
the program or intervention each individual received and when it occurred. If defining
who is served by the program is difficult or the program is uniformly applied in all
communities, then a time-series design is appropriate. Before-and-after designs
without control groups are often used, but are subject to a number of threats to validity, including maturation and secular changes (discussed above).
Performance Monitoring
Key Elements. Performance monitoring is used to provide information on: 1) key
aspects of how a system or program is operating; 2) whether, and to what extent, pre-
specified program objectives are being attained (e.g., numbers of youth served
compared to target goals, reductions in school dropouts com pared to target goals);
and 3) identification of failures to produce program outputs, for use in managing or
redesigning program operations. Performance indicators can also be developed to 4)
monitor service quality by collecting data on the satisfaction of those served, and 5)
report on program efficiency, effectiveness, and productivity by assessing the
relationship between the resources used (program inputs) and the output and outcome
indicators.
If conducted frequently enough and in a timely way, performance monitoring can
provide managers with regular feedback that will allow them to identify problems,
take timely action, and subsequently assess whether their actions have led to the
improvements sought. Performance measures can also stimulate communication about
program goals, progress, obstacles, and results among program staff and managers,
the public, and other stakeholders. They
focus attention on the specific outcomes desired and better ways to achieve them, and
can promote credibility by highlighting the accomplishments and value of the program.
Performance monitoring involves identification and collection of specific data on
program outputs, outcomes, and accomplishments. Although they
Process for Selecting Impact Evaluation Designs
may measure subjective factors such as client satisfaction, the data are numeric,
consisting of frequency counts, statistical averages, ratios, or percentages. Output
measures reflect internal activities: the amount of work done within the program or
organization. Outcome measures (immediate and longer term) reflect progress
towards program goals. Often the same measurements (e.g., number/percent of youth
who stopped or reduced substance abuse) may be used for performance monitoring
and impact evaluation. However, unlike impact evaluation, performance monitoring
does not make any rigorous effort to determine whether these were caused by program
efforts or by other external events.
Design Variations. When programs are operating in a number of communities, the
sites are likely to vary in mission, structure, the nature and extent of project
implementation, primary clients/targets, and timeliness. They may offer somewhat
different sets of services, or have identified somewhat different goals. In such
situations, it is advisable to construct a "core" set of performance measures to be used
by all, and to supplement these with "local" performance indicators that reflect
differences. For example, some youth programs will collect detailed data on youth
school performance, including grades, attendance, and disciplinary actions, while
others will simply have data on promotion to the next grade or whether the youth is
still enrolled or has dropped out. A multi-school performance monitoring system
might require data on promotion and enrollment for all schools, and specify more
detailed or specialized indicators on attendance or disciplinary actions for one or a
subset of schools to use in their own performance monitoring.
Considerations/Limitations. In selecting performance indicators, evaluators and
service providers need to consider:
The relevance of potential measures to the mission/objective of the local
program or national initiative. Do process indicators reflect program
strategies/activities identified in mission statements? Do outcome indicators
cover objectives identified in mission statements? Do indicators capture the
priorities at the community level?
The comprehensiveness of the set of measures. Does the set of performance
measures cover inputs, outputs, and service quality as well as outcomes and
include relevant items of customer feedback?
The program's control over the factor being measured. Does the program have
influence/control over the outputs or outcomes measured by the indicator? If
the program has only limited influence over the outputs or outcomes being
measured, the indicator may not fairly reflect program performance.
The validity of the measure. Do the proposed indicators reflect the range of
outcomes the program hopes to affect? Are the data free from obvious reporting
bias?
The reliability and accuracy of the measure. Can indicators be operationally
defined in a straightforward manner so that supporting data can be collected
consistently over time, across data gatherers, and across communities? Do
existing data sources meet these criteria?
The feasibility of collecting the data. How much effort and money is required to
generate each measure? Should a particularly costly measure be retained
because it is perceived as critically important?
Practical Issues. The set of performance indicators should be simple, limited to a few
key indicators of priority outcomes. Too many indicators burden the data collection
and analysis and make it less likely that managers will understand and use reported
information. At the same time, the set of indicators should be constructed to reflect the
informational needs of stakeholders at all levels-community members, agency
directors, and national funders.
Regular measurement, ideally quarterly, is important so that the system provides the
information in time to make shifts in program operations and to capture changes over
time. However, pressures for timely reporting should not be allowed to sacrifice data
quality. For the performance monitoring to take place in a reliable and timely way, the
evaluation should include adequate support and plans for training and technical
assistance for data collection. Routine quality control procedures should be
established to check on data entry accuracy and missing information. At the point of
analysis, procedures for verifying trends should be in place, particularly if the results
are unexpected.
The costs of performance monitoring are modest relative to impact evaluations, but
still vary widely depending on the data used. Most performance indicator data come
from records maintained by service providers. The added expense involves regularly
collecting and analyzing these records, as well as preparing and disseminating reports
to those concerned. This is typically a part-time work assignment for a supervisor
within the agency. The expense will be greater if client satisfaction surveys are used
to measure outcomes. An outside survey organization may be required for a large-
scale survey of past clients; alternatively, a self-administered exit questionnaire can be
given to clients at the end of services, In either case, the assistance of professional
researchers is needed in preparing data sets, analyses, and reports.
Example. The Asociacion Salud con Prevencion (ASCP) in Colombia, South
America, a non-govern mental organization which provides primary prevention
services which promote adolescent reproductive health, monitors outputs with data on
the number of professionals trained, the number of youth given educational services,
the number of workshops held, the number of condoms distributed, and the number of
medical and counseling sessions provided. The results demonstrate that the program is
providing promised services, but does not give an indication of the impact in terms of
either immediate outcomes such as use of birth control or longer-term outcomes
(which include reduced risk of out-of-wedlock births or early childbearing).
Process Analysis
Key Element. The key element in process analysis is a systematic, focused plan for
collecting data to: (1) determine whatever the program model is being implemented as
specified and, if not, how operations differ from those initially planned; (2) identify
unintended consequences and unanticipated outcomes; and (3) understand the program from the perspectives of staff, participants, and the community.
Design Variations. The systematic procedures used to collect data for process
evaluation often include case studies, focus groups, and ethnography.
Case studies involve the detailed analysis of selected program sites or clients to
determine how the program is operating, what barriers to program implementation
have been encountered, what strategies are the most successful, and what resources
and skills are necessary. The answers to these questions are useful in providing
guidance to policymakers and program planners interested in identifying key program
elements and in generating hypotheses
about program impact that can be tested in impact analyses. Case studies are
sometimes used to test competing hypotheses about differences in the impact of
services. This strategy is used to assess which approach is most successful in attaining
goals shared by all when competing models have emerged in different locations. This
requires purposely selecting sites to represent variations in elements or types of
programs, careful analysis of potential causal models, and the collection of qualitative data to elaborate the causal links at each site.
Clients or sites chosen for case studies should represent wide variation in settings,
program models, and clients. Identification of sample members within sites, interview
topics, and key data elements begins with the logic
model as a guide. In a case study, qualitative data, collected using semi-structured
interviews and observations of program operations, are often supplemented and
verified by quantitative data on program operations and performance collected from
records and reports.
Case studies may use several different approaches for collecting qualitative data for
program evaluation. The most frequently used are semi-structured interviews, focus
groups, and researcher observations while on-site. Semi
structured interviews allow for the discovery of unanticipated factors associated with
program interpretation and outcomes. Protocols for semi-structured interviews contain
specific questions about particular issues or program
practices. The "semi" aspect of these discussion guides refers to the fact that a
respondent may give as long, detailed, and complex a response as he or she desires to
the question-whatever conveys the full reality of the program's experience with the
issue at hand. If some issues have typical categories associated with them, the
protocols will usually contain probes to make sure the researcher learns about each
category of interest.
In case studies, observations at program sites provide an important method of
validating information from interviews. In this case, the observations will often be
guided by structured or semi-structured protocols designed to ensure that key items
reported in interviews are verified and that consistent procedures for rating program
performance are used across time and across sites.
Focus groups seek to understand attitudes through a series of group discussions
guided by one researcher acting as a facilitator, with another researcher present to take
detailed notes. Five or six general questions are selected to guide open-ended
discussions lasting about an hour and a half. The goals of the discussions may vary
from achieving group consensus to emphasizing points of divergence among
participants. Discussions are tape-recorded, but the primary record is the detailed
notes taken by the researcher who acts as recorder. Less detailed notes may also be
taken publicly, on a flip-chart for all to see, to try to achieve consensus or give group
members the chance to add anything they think is important. Soon after a particular
focus group, the recorder and facilitator summarize in writing the main points that
emerged in response to each of the general questions. When all focus groups are
completed, the researchers develop a combined summary, noting group differences
and suggesting hypotheses about those differences.
Ethnography relies almost exclusively on observation and unstructured interviews to
study:
Organizational and programmatic processes occurring at a program site;
The community context in which the program is taking place;
The relationship between program activities and other activities in the
community;
Causal processes as the participants view them; and
Modes of decision-making.
Ethnography does not begin with the logic model. Its intent is to understand the
program from the perspective of staff, participants, and others in the community.
Ethnographers observe program operations as unobtrusively as possible, sometimes in
the role of participant observer, and keep detailed field notes that are transcribed and
coded to identify emerging themes and trends. The critical research goal is to provide
data on the subjective experience of those in the program situation and to use this
information to understand if the program goals are being achieved and, if so, how.
Ethnography uses procedures that are deliberately flexible. As a result, ethnography is
helpful in gathering information on unintended consequences and unanticipated
outcomes. These unexpected observations may lead to an entirely new concept of
program delivery. In a recent project examining service integration programs for at-
risk youth, observations helped clarify that service integration needed to go beyond
formal links and on-paper agreements, and provided insights into how informal
processes bonded services together in their efforts to make a difference for high-risk
youth in the community.(17) Observations from ethnographic studies are perhaps the
hardest type of qualitative information to analyze, since they generate volumes of
information, much of which may not be directly related to evaluation goals and may
not be comparable across sites.
Practical Issues. Collecting qualitative data requires skilled researchers who are
experienced with the techniques being used. To analyze these data, careful notes must
be taken to ensure that responses are correctly recorded and to aid in interpreting
them. In methods based on interviews, interviewers must be trained to understand the
intent of each question, the possible variety of answers that respondents might give,
and ways to probe to ensure that full information about the issues under investigation
is obtained.
Analysis of qualitative data requires an in-depth understanding of programs,
respondents and responses, and especially the context in which they are evaluated.
Ultimately, the analyst makes judgments regarding the relative importance or
significance of various responses. This requires an unbiased assessment of whether
responses support or refute hypotheses about the way the program works and the
effects it has.
One way to handle qualitative data is to treat one's interview and observational notes
as text, and to conduct a textual analysis using specialized computer software that can
search for the presence of specific themes or content. Qualitative software is available
to facilitate the location and retrieval of information from massive textual files. This
kind of software is expensive to use because huge amounts of text must be entered
into a computer. Further, either the exact words one wants to search for must appear
in the text, or the text marked for the presence of any theme or topic that the
researcher wants to retrieve. Often researchers can achieve equal or better results with
carefully constructed interview or data collection guides or structured focus groups,
and systematically recording of responses or coding of data encountered in the field.
Example. Case studies of two pilot projects were used for the evaluation of mentoring
in the juvenile justice system conducted by Public/Private Ventures. The program was
designed to match 100 mentors to at-risk youth. Mentors were trained to meet with
youth one-on-one before and after the youth's release from juvenile detention
facilities, with the goal of establishing an attachment to an adult role model. Data
were collected from mentor logs, program records, court records, structured
interviews with mentors and youth before and after program participation, staff
interviews, focus groups with mentors, youth and service agency staff, and in-depth
interviews with mentor-youth pairs. The qualitative analysis examined the
characteristics of successful matches, issues in program implementation, the style and
content of mentoring interactions, and program staffing. Although it does not offer
evidence on outcomes, the evaluation provides extremely useful information on the
process of implementing a mentoring program and guidance for program development
and replication.
Cost Studies
Key Elements. Cost studies are used to assess investments in programs by collecting
information on: 1) direct program expenditures; 2) the costs of staff and resources
provided by other agencies or diverted from other uses; 3) costs for purchased
services; and 4) the value of donated time and materials. Costs for the first two items
usually include expenditures for staff salaries; fringe benefits; special training costs (if
any); travel; facilities; and supplies and equipment that have to be purchased. The
value of donated resources, which can be substantial, generally has to be estimated
and requires careful documentation of the donation. Cost analyses indicate that
donations are a major cost item in many youth programs. For example, the Cities in
Schools (18) evaluation indicated that donations are between 74 percent and 90
percent of the total direct program costs, and that the wide variation among cities in
the types of donations received made the inclusion of these costs essential to an
understanding of the resources required to sustain. program operation.
The typical approach to cost studies is to calculate total program costs and then
an average cost per client, calculated by dividing the total by either the total number
of clients served, or the total number of clients who meet some standardized definition
of success. This type of cost calculation can be linked to results of an experimental or
quasi-experimental impact evaluation to estimate costs per successful client. It can
also be used with performance indicators to assess the cost or cost-efficiency of
achieving program goals.
A second approach to cost estimation calculates the cost per unit of service. For
example, the cost per hour of classroom instruction or the cost per hour of counseling.
This type of cost calculation is then used in impact evaluations (including non-
experimental evaluations) to look at the costs of different outcomes. This type of cost
analysis is difficult in multi-faceted, comprehensive programs in which the level and
type of service are highly variable and may involve a number of service providers. It
is also difficult in programs in which defining exposure to services is difficult. Where
possible, it is preferable to distinguish between fixed costs (e.g., rent or the director's
salary) and variable costs (e.g., the costs of special events or the hourly costs of the
recreation director). The variable costs can then be used to estimate the marginal cost
of adding additional clients to the number receiving a specific unit of service.
Design Variations. Cost studies can be undertaken to describe the program costs and
link these to the level of outcomes achieved. In this application, the costs are
compared to the level and type of outcomes documented in performance monitoring
outcomes. Decisions on whether the outcomes justify the costs are based on opinions
about the value of the outcomes (not monetized) and the likelihood that the outcomes
are attributable to the program.
Cost-effectiveness analysis is used to compare the costs of different approaches to
providing some standard level of service or desired level of outcome. This approach is
most useful when multiple programs are using different models to provide a service.
The requirements are that the characteristics of target populations served, the program
goals, and the output or outcome measures be identical. For example, cost-
effectiveness studies could compare the relative effectiveness of residential and
nonresidential treatment for drug-abusing youth, provided that the youth served were
similar in age and drug use problems, and that the same measures of treatment success
were used.
Cost-benefit studies provide estimates of the dollar benefits returned for each dollar
spent on the program-the key question from a policy perspective, but one that is not
easily answered. This type of evaluation has rigorous requirements for: 1) an estimate
of program costs, either per client or per unit of service; 2) estimates of the value of
the benefits; and 3) comparative data on program impact-an estimate of outcomes
with and without the program. The first item should be obtainable from program
financial records, supplemented as needed by estimates of the cost of donated or
reallocated resources. The second can be obtained from an experimental or quasi-
experimental evaluation of program impact or another strategy for estimating the
difference between what happened and what would have happened without the
program.
The primary barrier to conducting cost-benefit analysis of service programs designed
to change behavior stems from the third item: placing dollar values on benefits. Many
benefits are of intrinsic value (e.g., reductions in family dysfunction and conflict) but
quantifying that value is difficult.
Monetization of benefits to individuals requires assumptions about three matters, all of
which are frequently controversial. First, the dollar value of the benefit may depend
on personal values, for example, what residents are willing to pay for a crime-free
neighborhood. Second, a dollar of benefit today is worth more than a dollar benefit
realized next year. Thus, the benefits need to be time discounted, but by how much is
a difficult question. Third, the beneficiaries need to be identified. Societal values
become important when the beneficiaries differ in standing and perceived merit. For
example, a high school equivalency degree for a violent youthful offender may result
in the same gains in lifetime earnings for the offender as a violence victim would
realize from physical therapy for the injury. Are they to be treated the same? To
circumvent such difficult questions, the analyst may conduct a sensitivity analysis to
reach conclusions based on explicit assumptions of value. For example, the
neighborhood crime prevention program may be deemed cost-effective if "residents
are willing to pay at least $100 per month for 10 percent lower rates of burglary" or "if
the discount rate is less than 6 percent" or "if the offender's earnings are worth 50
percent of the victim's earnings."
Beyond benefits to individuals, the total value of benefits includes the social costs
averted. These are the savings to the public that result from avoiding negative
outcomes. These values must be based on studies that estimate the social costs of
negative outcomes such as the costs of crime or drug abuse.(19)" These estimates are
difficult to derive and are often based on tenuous assumptions. To compensate for
problems in the reliability of estimates, cost-benefit calculations normally use a range
of benefits to place an upper and a lower bound on the probable returns to investments
in the program. A more significant problem is that monetary values based on public
costs for the negative outcomes among the general population may be poor estimates
of the value of benefits among the program's target population. For example, national
estimates of the costs of drug abuse may not apply to reductions in amphetamine
abuse among low-income adolescents in a single city. This problem needs to be
acknowledged and value estimates revised to the extent possible to reflect savings for
the program's participants. Other public benefits reflecting gains, not costs averted,
are widely acknowledged, but rarely find their way into cost-benefit studies because
there is no public consensus on their importance. Examples include improvements in
the quality of life or the environment.
Considerations and Limitations. Documentation of gains to prevention programs is
exceptionally difficult and requires estimating negative outcomes that did not occur.
As described above, the most robust estimates of program impacts of this kind are
based on experimental evaluations or quasi-experimental evaluations, which are
difficult and expensive to conduct. When the program has total population coverage, it
is possible to interpret differences between the observed trend and predicted trend in
an outcome indicator over time to program impacts and estimate the monetary value
of the benefits. This strategy was used to estimate the value of drug prevention efforts
in the United States. National survey estimates of drug use in 1979 were used to
estimate expected drug prevalence during the 1980s and early 1990s; the differences
between these estimates and drug use prevalence rates based on national surveys
during these years were attributed to federal investments in drug prevention programs.(20)
Practical Issues. Developing a conceptual framework that reflects all the issues in
cost-benefit valuation, and then devoting the resources necessary for estimating the
range of benefits, can require as much research time and expertise as determining
whether the program had any impact. However, research dollars are always limited
and evaluating program impact is usually the top priority, since valuing benefits is
irrelevant if there is no program impact. A number of studies of the value of
preventing negative outcomes among children and youth have been initiated recently.
These can be expected to give program evaluators substantial help in estimating the
value of reductions in youth problems for use in cost-benefit studies in the future.
Example. An evaluation of 13 delinquency prevention programs in Los Angeles
County estimated cost effectiveness as a function of the delinquency risk of the
population of youth served, costs, and success rate. This study compared cost to
benefit ratios of alternative programs designed with a common goal and outcome
measure-preventing subsequent arrest. Because the risks of delinquency varied among
the youth served by different programs, estimates of the risk of delinquency was
derived from existing research and used to classify the youth served by the program
into four risk categories. Program costs were estimated by taking the total budgets
from all sources divided by the number of clients. Costs of public expenditures for
delinquency (costs to the community and justice system) were estimated from the
proportion of the justice system budget (from the County budget) devoted to juvenile
cases, divided by the number of juvenile cases at various stages of processing (from
annual reports of the Los Angeles Probation Department, the California Youth
Authority, and the U.S. Department of Justice).
The public costs averted were calculated by dividing the budget by the number of
arrests of youth following program participation and calculating the savings as the
difference between the two. The benefits of reductions in expected future arrests were
estimated based on the probability of subsequent arrests reported in studies of criminal
careers times the estimated public savings per arrest averted. Savings to victims were
based on estimates of the costs of damage and loss for each type of juvenile offense
from earlier research, adjusted for inflation. These costs per offense were applied to
the expected lifetime arrests in the absence of the program and benefits were
estimated as the difference between these costs and the absence of costs associated
with no further arrests or victimization (estimating that for each arrest, there are four
to five offenses that do not result in arrest). Thus, estimated program benefits were the
sum of the public costs averted and the savings to victims.
The results were used to estimate the cost differential (costs divided by the value of
benefits) to programs with different rates of success (measures as arrests prevented),
controlling for the risk of offending of the juvenile population served. The findings
were used to estimate the success rate required to show a positive rate of return given
the delinquency risk of the population served for programs with different cost
differentials. This estimate can be used in monitoring the performance of a wide
variety of delinquency prevention programs.
Identifying Potential Evaluation Problems A number of challenging problems face those who would apply research methods to
the evaluation of human services programs. We summarize these, based on experience
in reviewing and evaluating programs for vulnerable children and youth, to guide
development of realistic evaluation plans. (21)
Defining Program Participation. Programs may be open-ended, lacking both formal
intake procedures and policies for determining when the program is "completed." An
evaluation can only yield interpretable results if participation is explicitly defined and
uniformly measured. In the case of programs for vulnerable youth, for example,
counselors may be contacted for several chats, followed some weeks later by an
appointment, followed by intermittent participation in some, possibly not all, services
offered. Youth may stop attending and then resume. Limiting participation in the
evaluation to those who attend regularly is not an appropriate solution because
dropping from consideration the youth who are most difficult to engage produces
biased results. Often identifying who "participated" and for how long
requires multiple categories to adequately reflect the variations in type, duration, and
intensity of participation among the youth served. In addition, participants should be
followed from the point of first contact and all major program activity documented.
Evaluators also need to decide whether others who potentially benefit from the
program-such as parents, boyfriends/girlfriends, or siblings-are defined as program
participants. If so, their participation in program activities should also be tracked. If
not, plans need to be made on how to count the gains made by these indirect program
beneficiaries in evaluating program impact.
Evaluating the Relationship between Participation and Outcomes. Many programs
emphasize individualized services tailored to need. In the youth services area, youth
with the highest levels of risk are offered the greatest number or most intensive level
of services. Obviously, assignment to treatment in this case is not random, and the
multi-problem youth may never achieve the same level of positive outcomes as youth
who began with fewer problems. For example, studies of the School-Based Health
Centers in the U.S. show that frequent clinic users were at greater risk for alcohol and
substance use, sexual activity, and poor family and peer relationships(22). Thus,
comparing their outcomes to those for nonusers or those who used the clinic less
frequently would be inappropriate. Similarly, comparisons between different
programs must consider any differences in type and level of risk exhibited by
participants. For this reason, data on the risks and needs of participants should be
collected at intake for use in analysis and a pre-post design used when possible.
Defining the Unit of Analysis. Deciding on the appropriate unit of analysis can be
difficult, particularly in evaluating comprehensive programs. Programs may target
entire neighborhoods, classrooms, or families for change-sometimes planning
activities directly for different groups, and sometimes planning carryover effects.
Measurement at multiple levels is appropriate as long as each level is clearly defined.
For example, crime reduction can be assessed by comparing neighborhood rates of
calls for police services, household victimization rates, or youth delinquency surveys.
Economic gains can be measured by changes in the area unemployment rate, average
household or family income, or individual earnings. The selection should be closely
linked to program goals and activities.
Evaluations of services integration programs, including most that use a case
management approach, will face additional challenges in: 1) tracking the services
received by participants; 2) developing common agreements among agencies on
program goals and required components; 3) documenting service delivery by multiple
agencies; 4) measuring effects of the service delivery system; and 5) differentiating
services integration from service comprehensiveness. Each is discussed briefly below.
Tracking the services received by participants. Services integration usually involves
referring participants to other agencies for needed assistance. A critical, and often
difficult, problem is determining which services ' were actually received. Clients may
or may not contact agencies to which they are referred, may or may not be accepted
for services, and may or may not participate in services, if accepted. Documenting the
chain of participation is essential to determine the extent to which services integration
is being achieved, but is time consuming and often resisted by programs who see
making the referral as the extent of their responsibility. Because staff turnover in
service agencies is frequently high, preparing written agreements on data access and
sharing is strongly recommended. In the absence of adequate agency documentation,
information on service utilization can be collected in follow-up interviews with
clients.
Developing common agreements among agencies on program goals and required
components. The agencies collaborating in a services integration effort may differ in
their vision of the program's goals, key strategies, and how youth needs will be
evaluated and problems addressed. Evaluations tend to highlight these differences,
which can constitute a barrier in gaining consensus on what is being evaluated. This is
particularly true when multiple agencies recruit clients and/or case management
services are not centralized. Time should be allocated for face-to-face meetings to get
agreement on whom evaluators will count in selecting measures of program outcomes,
and how service provision is expected to achieve program goals.
Documenting service delivery by multiple agencies. When many agencies coordinate
and combine their resources to meet the needs of clients, one of the most difficult
problems is assembling information on who received what types and amounts of
service. Agencies have different methods of identifying clients. In the area of
vulnerable children and youth, some use family identification numbers, others identify
individual children served. Some group service records by family or child; others
maintain records by contact, which introduces multiple records for single clients
which then have to be checked to remove duplication. Agencies such as schools or
juvenile courts can face legal or professional barriers to sharing client-based
information with other agencies or evaluators. A systematic system for collecting the
data needed to compile a complete picture of program participation must be developed
early in the planning process and, as noted above, supported by written agreements
and ongoing technical assistance and staff training in record-keeping procedures.
Measuring effects of the service delivery system. A primary goal of services
integration is to change agency operations and increase effectiveness. These outcomes
need to be measured at the agency, not individual, level. Evaluations of services
integration need to document changes in agency procedures, increased participation in
collaborative planning and service delivery, and decreases in barriers to interagency
cooperation and client service associated with policies, and procedures. Referral
patterns should show more diversity in planning. At the individual level, clients
should report fewer unmet service needs, shorter waiting periods for service, and
increased satisfaction with the response to their needs. Other evidence of integration
includes increased staff knowledge and familiarity with the resources of other
agencies and community groups.
Differentiating services integration from service comprehensiveness. Services
integration is intended to provide not only faster, more appropriate services, but also
services that would not otherwise be available to certain clients. The referral process
educates clients on the options and assistance potentially available. Improved
interagency planning and coordination reduces the barriers to obtaining additional
services. All this makes the task of differentiating services integration from service
comprehensiveness very difficult. Evaluation and program staff need to develop clear
expectations on the extent to which the ease of obtaining services and the
appropriateness of the service package can be distinguished from the extent to which
the program is providing comprehensive services to meet the full range of client
needs.
Conclusions Strong pressure to demonstrate program impacts dictates making evaluation activities
a required and intrinsic part of program activities from the start. At the very least,
evaluation activities should include performance monitoring. The collection and
analysis of data on program progress and process builds the capacity for self-
evaluation and contributes to good program management and efforts to obtain support
for program continuation-for example, when the funding is serving as "seed" money
for a program that is intended, if successful, to continue under local sponsorship.
Performance monitoring can be extended to non-experimental evaluation with
additional analysis of program records and/or client surveys. These evaluation
activities may be conducted either by program staff with research training or by an
independent evaluator. In either case, training and technical assistance to support
program evaluation efforts will be needed to maintain data quality and assist in
appropriate analysis and use of the findings.
There are several strong arguments for evaluation designs that go further in
documenting program impact. Only experimental or quasi-experimental designs
provide convincing evidence that program funds are well invested, and that the
program is making a real difference to the well-being of the population served. These
evaluations need to be conducted by experienced researchers and supported by
adequate budgets. A good strategy may be implementing small-scale programs to test
alternative models of service delivery in settings that will allow a stronger impact
evaluation design than is possible in a large scale, national program. Often program
evaluation should proceed in stages. The first year of program operations can be
devoted to process studies and performance monitoring, the information from which
can serve as a basis for more extensive evaluation efforts once operations are running
smoothly.
Finally, planning to obtain support for the evaluation at every level--community,
program staff, agency leadership and funder--should be extensive. Each of these has a
stake in the results. Each should have a voice in planning. And each should perceive
clear benefits from the results. Only in this way will the results be acknowledged as
valid and actually used for program improvement.
Notes 1. Connell, J.P., Kubisch, A.C., Schorr, L.B., and Weiss, C.H. (1995) New Approaches to Evaluating Community
Initiatives: Concepts, Methods, and Contexts. Washington, DC: The Aspen Institute.
2. Kumpfer, K.L, Shur, G.H., Ross, J.H., Bunnell, K.K., Librett, J.J. and Milward, A.R. (1993) Measurements in
Prevention: A Manual on Selecting and Using Instruments to Evaluate Prevention Programs. Public Health Service,
U.S. Department of Health and Human Services, (SMA) 93-2041.
3. For more information on deciding when and how to make decisions on whether and how to conduct a program
evaluation, see Schmidt, R.E., J.B. Bell, and JW. Scanlon (1979), "Evaluability Assessment: Making Public
Programs Work Better," Human Services Monograph Series, 14: 4-5. Washington, DC; and Wholey, Joseph
S. (1994), "Assessing the Feasibility and Likely Usefulness of Evaluation." In Joseph S. Wholey, Harry P. Hatry,
and Katherine E. Newcomer (eds.), Handbook of Practical Evaluation, 15-39. San Francisco: Jossey-Bass.
4. Berk, R.A., and Sherman, L.W. (1988) "Police Responses to Family Violence Incidents: An Analysis of an
Experimental Design with Incomplete Randomization." Journal of the American Statistical Association 83(401):70-
76.
5. Kalbfleish, J.D., and Prentice, K.L. (1980) The Statistical Analysis of Failure Time Data. New York: Wiley.
6. Rhodes, W.M. (1986) "A Survival Model with Dependent Competing Events and Right-hand Censoring:
Probation and Parole as an Illustration. "Journal of Quantitative Criminology 2(2): 113-138.
7. Ellickson, P.L., Bell, R.M., and McGuigan, K. (1993) "Preventing Adolescent Drug Use: Long-Term Results of a
Junior High School Program." American Journal of Public Health 83(6): 856-861.
8. See Campbell, D.T. and Stanley, J.C. (1963) Experimental and Quasi-experimental Designs for
Research. Chicago: Rand McNally.
9. Campbell and Stanley (1963).
10. Loftin, C., McDowall, D., Wiersma, B., and Cottey, T.J. (1991) "Effects of Restrictive Licensing of Handguns
on Homicide and Suicide in the District of Columbia." New England -journal of Medicine 325 (December 5): 1615-
1620.
11. Heckman, J.J. (1979) "Sample Selection Bias as a Specification Error." Econometrica 47:153-162.
12. Joreskog, K.G. (1977) "Structural Equation Models in the Social Sciences." In P.R. Krishnaiah
(ed.), Applications of Statistics, 265-287. Amsterdam: North-Holland; Bryk, A.S. and Raudenbush, S.W. (1992)
Hierarchical Linear Models: Applications and Meta-Analysis Techniques. Newbury Park, CA: Sage.
13. Roosa, M.W. and Vaughan, L. (1983) "Teen Mothers Enrolled in an Alternative Parenting Program: A
Comparison with Their Peers." Urban Education 18: 348-360.
14. Engle, R-F and Granger, CW.J. (1987) "Cointegration and Error Correction: Representation, Estimation and
Testing." Econometrica 55: 25 1-276.
15. Barker, G. and Fontes, M. (1995) "Review and Analysis of International Experience with Programs Targeted at
At-Risk Youth." Paper prepared for the World Bank.
16. Barker and Fontes (1995).
17. Chaiken, M. (1990) "Evaluation of Girls Clubs of America's Friendly PEERsuasion Program." In R.R. Watson
(ed.), Drug and Alcohol Abuse Prevention, 265-287. Clifton, NJ: Humana Press.
18. Rossman, S.B and Morley, E. (1994) The National Evaluation of Cities in Schools. Report submitted to the
Office of Juvenile Justice and Delinquency Prevention. Washington, DC: The Urban Institute.
19. Cohen, M. (1994) "The Monetary Value of Saving a High Risk Youth. " Draft report. Washington, DC: The
Urban Institute.
20. Kim, S., Coletti, S.D., Crutchfield, C.C., Williams, C. and Hepler, N. (1995) "Benefit-Cost Analysis of Drug
Abuse Prevention Programs: A Macroscopic Approach." Journal of Drug Education 25(2): 1 11-127.
21. Burt, M. R. and Resnick, G. ( 1992) Youth at Risk: Evaluation Issues. Washington, DC: The Urban Institute.
22. Barker and Fontes (1995).