Evaluation Design
This case study is based on “Pitfalls of Participatory Programs:
Evidence from a Randomized Evaluation in India,” by Abhijit Banerjee
(MIT), Rukmini Banerjee (Pratham), Esther Duflo (MIT), Rachel
Glennerster (J-PAL), and Stuti Khemani (The World Bank)
J-PAL thanks the authors for allowing us to use their paper
Case 2: Learn to Read Evaluations
Evaluating the Read India Campaign How to Read and Evaluate Evaluations
J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations
Key Vocabulary
Why Learn to Read (L2R)?
In a large-scale survey conducted in 2004, Pratham discovered that only 39% of children (aged 7- 14) in rural Uttar Pradesh could read and understand a simple story, and nearly 15% could not recognize even a letter.
During this period, Pratham was developing the “Learn-to-Read” (L2R) module of its Read India campaign. L2R included a unique pedagogy teaching basic literacy skills, combined with a grassroots organizing effort to recruit volunteers willing to teach.
This program allowed the community to get involved in children’s education more directly through village meetings where Pratham staff shared information on the status of literacy in the village and the rights of children to education. In these meetings, Pratham identified community members who were willing to teach. Volunteers attended a training session on the pedagogy, after which they could hold after-school reading classes for children, using materials designed and provided by Pratham. Pratham staff paid occasional visits to these camps to ensure that the classes were being held and to provide additional training as necessary.
Did the Learn to Read project work?
Did Pratham’s “Learn to Read” program work? What is required in order for us to measure whether a program worked, or whether it had impact?
1. Counterfactual: what would have happened to the participants in a program had they not received the intervention. The counterfactual cannot be observed from the treatment group; can only be inferred from the comparison group. 2. Comparison Group: in an experimental design, a randomly assigned group from the same population that does not receive the intervention that is the subject of evaluation. Participants in the comparison group are used as a standard for comparison against the treated subjects in order to validate the results of the intervention. 3. Program Impact: estimated by measuring the difference in outcomes between comparison and treatment groups. The true impact of the program is the difference in outcomes between the treatment group and its counterfactual. 4. Baseline: data describing the characteristics of participants measured across both treatment and comparison groups prior to implementation of intervention. 5. Endline: data describing the characteristics of participants measured across both treatment and comparison groups after implementation of intervention. 6. Selection Bias: statistical bias between comparison and treatment groups in which individuals in one group are systematically different from those in the other. These can occur when the treatment and comparison groups are chosen in a non-random fashion so that they differ from each other by one or more factors that may affect the outcome of the study. 7. Omitted Variable Bias: statistical bias that occurs when certain variables/characteristics (often unobservable), which affect the measured outcome, are omitted from a regression analysis. Because they are not included as controls in the regression, one incorrectly attributes the measured impact solely to the program.
J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations
In general, to ask if a program works is to ask if the program achieves its goal of changing certain outcomes for its participants, and ensure that those changes are not caused by some other factors or events happening at the same time. To show that the program causes the observed changes, we need to simultaneously show that if the program had not been implemented, the observed changes would not have occurred (or would be different). But how do we know what would have happened? If the program happened, it happened. Measuring what would have happened requires entering an imaginary world in which the program was never given to these participants. The outcomes of the same participants in this imaginary world are referred to as the counterfactual. Since we cannot observe the true counterfactual, the best we can do is to estimate it by mimicking it.
The key challenge of program impact evaluation is constructing or mimicking the counterfactual. We typically do this by selecting a group of people that resemble the participants as much as possible but who did not participate in the program. This group is called the comparison group. Because we want to be able to say that it was the program and not some other factor that caused the changes in outcomes, it is important that the only difference between the comparison group and the participants is that the comparison group did not participate in the program. We then estimate “impact” as the difference observed at the end of the program between the outcomes of the comparison group and the outcomes of the program participants.
The impact estimate is only as accurate as the comparison group is successful at mimicking the counterfactual. If the comparison group poorly represents the counterfactual, the impact is (in most circumstances) poorly estimated. Therefore the method used to select the comparison group is a key decision in the design of any impact evaluation.
That brings us back to our questions: Did the Learn to Read project work? What was its impact on children’s reading levels?
In this case, the intention of the program is to “improve children’s reading levels” and the reading level is the outcome measure. So, when we ask if the Learn to Read project worked, we are asking if it improved children’s reading levels. The impact is the difference between reading levels after the children have taken the reading classes and what their reading level would have been if the reading classes had never existed.
For reference, Reading Level is an indicator variable that takes value 0 if the child can read nothing, 1 if he knows the alphabet, 2 if he can recognize words, 3 if he can read a paragraph, and 4 if he can read a full story.
What comparison groups can we use? The following experts illustrate different methods of evaluating impact. (Refer to the table on the last page of the case for a list of different evaluation methods).
Estimating the impact of the Learn to Read project
Method 1:
News Release: Read India helps children Learn to Read. Pratham celebrates the success of its “Learn to Read” program—part of the Read India Initiative. It has made significant progress in its goal of improving children’s literacy rates through better learning materials, pedagogical methods, and most importantly, committed volunteers. The achievement of the “Learn to Read” (L2R) program demonstrates that a revised curriculum, galvanized by community mobilization, can produce significant gains. Massive government expenditures in mid-day meals and school construction have failed to achieve similar results. In less than a year, the reading levels of children who enrolled in the L2R camps improved considerably.
J-PAL Executive Education Course Case Study 2:Learn to Read Evaluations
Baseline
Distribution of Endline Scores for Baseline
Non-Readers (Zero)
100
80
.... 60 c: QJ
u... QJ
c..
20
0
Baseline
Word Paragraph
Story
Distribution of Endline Scores for Baseline
Letter Readers
100
80
.... c: QJ
u ... c.. QJ
60
40
20
0
Paragraph Story
40
J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations
Just before the program started, half these children could not recognize Hindi words—many nothing at all. But after spending just a few months in Pratham reading classes, more than half improved by at least one reading level, with a significant number capable of recognizing words and several able to read full paragraphs and stories! On average, the literacy measure of these students improved by nearly one full reading level during this period.
Discussion Topic 1:
1. What type of evaluation does this news release imply?
2. What represents the counterfactual?
3. What are the problems with this type of evaluation?
J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations
Method 2:
Opinion: The “Read India” project not up to the mark Pratham has raised millions of dollars, expanding rapidly to cover all of India with its so-called “Learn-to-Read” program, but do its students actually learn to read? Recent evidence suggests otherwise. A team of evaluators from Education for All found that children who took the reading classes ended up with literacy levels significantly below those of their village counterparts. After one year of Pratham reading classes, Pratham students could only recognize words whereas those who steered clear of Pratham programs were able to read full paragraphs.
Comparison of reading levels of children who took
reading classes Vs. reading levels of children who did
not take them
3
2.5
2
1.5
1
0.5
0
Did not take reading classes/ Took reading classes
Notes: Reading Level is an indicator variable that takes value 0 if the child can read nothing, 1 if he
knows the alphabet, 2 if he can recognize words, 3 if he can read a paragraph and 4 if he can read a full
story.
If you have a dime to spare, and want to contribute to the education of India’s illiterate children, you may think twice before throwing it into the fountain of Pratham’s promises.
Discussion Topic 2:
1. What type of evaluation is this opinion piece employing?
2. What represents the counterfactual?
3. What are the problems with this type of evaluation?
Mean reading level for children who did not take reading classes
Mean reading level for children who took reading classes
R e a d
in g
L e v e l
J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations
Method 3:
Letter to the Editor: EFA should consider Evaluating Fairly and Accurately There have been several unfair reports in the press concerning programs implemented by the NGO Pratham. A recent article by a former Education for All bureaucrat claims that Pratham is actually hurting the children it recruits into its ‘Learn-to-Read’ camps. However, the EFA analysis uses the wrong metric to measure impact. It compares the reading levels of Pratham students with other children in the village—not taking into account the fact that Pratham targets those whose literacy levels are particularly poor at the beginning. If Pratham simply recruited the most literate children into their programs, and compared them to their poorer counterparts, they could claim success without conducting a single class. But Pratham does not do this. And realistically, Pratham does not expect its illiterate children to overtake the stronger students in the village. It simply tries to initiate improvement over the current state. Therefore the metric should be improvement in reading levels—not the final level. When we repeated EFA’s analysis using the more-appropriate outcome measure, the Pratham kids improved at twice the rate of the non-Pratham kids (0.6 reading level increase compared to 0.3). This difference is statistically very significant.
Had the EFA evaluators thought to look at the more appropriate outcome, they would recognize the incredible success of Read India. Perhaps they should enroll in some Pratham classes themselves.
Discussion Topic 3:
1. What type of evaluation is this letter using?
2. What represents the counterfactual?
3. What are the problems with this type of evaluation?
Methodology Description Who is in the comparison group? Required Assumptions Required Data
Pre-Post
Measure how program participants improved (or changed) over time.
Program participants themselves—before participating in the program.
The program was the only factor influencing any changes in the measured outcome over time.
Before and after data for program participants.
Simple
Difference
Measure difference between program participants and non-participants after the
program is completed.
Individuals who didn’t participate in the program (for any reason), but for whom data were collected after
the program.
Non-participants are identical to participants except for program participation, and were equally likely to
enter program before it started.
After data for program participants and non-
participants.
Differences in
Differences
Measure improvement (change) over time of program participants relative to the improvement (change) of non-participants.
Individuals who didn’t participate in the program (for any reason), but for whom data were collected both before and after the program.
If the program didn’t exist, the two groups would have had identical trajectories over this period.
Before and after data for both participants and non- participants.
Multivariate
Regression
Individuals who received treatment are compared with those who did not, and other factors that
might explain differences in the outcomes are
“controlled” for.
Individuals who didn’t participate in the program (for any reason), but for whom data were collected both
before and after the program. In this case data is not
comprised of just indicators of outcomes, but other “explanatory” variables as well.
The factors that were excluded (because they are unobservable and/or have been not been measured)
do not bias results because they are either
uncorrelated with the outcome or do not differ between participants and non-participants.
Outcomes as well as “control variables” for both
participants and non-
participants.
Statistical
Matching
Individuals in control group are compared to
similar individuals in experimental group.
Exact matching: For each participant, at least one
non-participant who is identical on selected
characteristics. Propensity score matching: non-participants who
have a mix of characteristics which predict that they
would be as likely to participate as participants.
The factors that were excluded (because they are
unobservable and/or have been not been measured)
do not bias results because they are either uncorrelated with the outcome or do not differ
between participants and non-participants.
Outcomes as well as
“variables for matching”
for both participants and non-participants.
Regression
Discontinuity
Design
Individuals are ranked based on specific,
measureable criteria. There is some cutoff that
determines whether an individual is eligible to
participate. Participants are then compared to non-participants and the eligibility criterion is
controlled for.
Individuals who are close to the cutoff, but fall on the
“wrong” side of that cutoff, and therefore do not get
the program.
After controlling for the criteria (and other measures
of choice), the remaining differences between
individuals directly below and directly above the
cut-off score are not statistically significant and will not bias the results. A necessary but sufficient
requirement for this to hold is that the cut-off criteria
are strictly adhered to.
Outcomes as well as
measures on criteria (and
any other controls).
Instrumental
Variables
Participation can be predicted by an incidental
(almost random) factor, or “instrumental”
variable, that is uncorrelated with the outcome, other than the fact that it predicts participation
(and participation affects the outcome).
Individuals who, because of this close to random factor, are predicted not to participate and (possibly as a result) did not participate.
If it weren’t for the instrumental variable’s ability to
predict participation, this “instrument” would
otherwise have no effect on or be uncorrelated with the outcome.
Outcomes, the “instrument,” and other control variables.
E x
p er
im en
ta l
M et
h o d
Randomized
Evaluation
Experimental method for measuring a causal
relationship between two variables.
Participants are randomly assigned to the control
groups.
Randomization “worked.” That is, the two groups
are statistically identical (on observed and unobserved factors).
Outcome data for control
and experimental groups.
Control variables can help absorb variance and
improve “power”.
MIT OpenCourseWare http://ocw.mit.edu
Resource: Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Dr. Rachel Glennerster, Prof. Abhijit Banerjee, Prof. Esther Duflo
The following may not correspond to a particular course on MIT OpenCourseWare, but has been provided by the author as an individual learning resource.
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.