Evaluation Design

profileHOI2022
LearntoReadcasestudyCASE1-3.pdf

This case study is based on “Pitfalls of Participatory Programs:

Evidence from a Randomized Evaluation in India,” by Abhijit Banerjee

(MIT), Rukmini Banerjee (Pratham), Esther Duflo (MIT), Rachel

Glennerster (J-PAL), and Stuti Khemani (The World Bank)

J-PAL thanks the authors for allowing us to use their paper

Case 2: Learn to Read Evaluations

Evaluating the Read India Campaign How to Read and Evaluate Evaluations

J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations

Key Vocabulary

Why Learn to Read (L2R)?

In a large-scale survey conducted in 2004, Pratham discovered that only 39% of children (aged 7- 14) in rural Uttar Pradesh could read and understand a simple story, and nearly 15% could not recognize even a letter.

During this period, Pratham was developing the “Learn-to-Read” (L2R) module of its Read India campaign. L2R included a unique pedagogy teaching basic literacy skills, combined with a grassroots organizing effort to recruit volunteers willing to teach.

This program allowed the community to get involved in children’s education more directly through village meetings where Pratham staff shared information on the status of literacy in the village and the rights of children to education. In these meetings, Pratham identified community members who were willing to teach. Volunteers attended a training session on the pedagogy, after which they could hold after-school reading classes for children, using materials designed and provided by Pratham. Pratham staff paid occasional visits to these camps to ensure that the classes were being held and to provide additional training as necessary.

Did the Learn to Read project work?

Did Pratham’s “Learn to Read” program work? What is required in order for us to measure whether a program worked, or whether it had impact?

1. Counterfactual: what would have happened to the participants in a program had they not received the intervention. The counterfactual cannot be observed from the treatment group; can only be inferred from the comparison group. 2. Comparison Group: in an experimental design, a randomly assigned group from the same population that does not receive the intervention that is the subject of evaluation. Participants in the comparison group are used as a standard for comparison against the treated subjects in order to validate the results of the intervention. 3. Program Impact: estimated by measuring the difference in outcomes between comparison and treatment groups. The true impact of the program is the difference in outcomes between the treatment group and its counterfactual. 4. Baseline: data describing the characteristics of participants measured across both treatment and comparison groups prior to implementation of intervention. 5. Endline: data describing the characteristics of participants measured across both treatment and comparison groups after implementation of intervention. 6. Selection Bias: statistical bias between comparison and treatment groups in which individuals in one group are systematically different from those in the other. These can occur when the treatment and comparison groups are chosen in a non-random fashion so that they differ from each other by one or more factors that may affect the outcome of the study. 7. Omitted Variable Bias: statistical bias that occurs when certain variables/characteristics (often unobservable), which affect the measured outcome, are omitted from a regression analysis. Because they are not included as controls in the regression, one incorrectly attributes the measured impact solely to the program.

J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations

In general, to ask if a program works is to ask if the program achieves its goal of changing certain outcomes for its participants, and ensure that those changes are not caused by some other factors or events happening at the same time. To show that the program causes the observed changes, we need to simultaneously show that if the program had not been implemented, the observed changes would not have occurred (or would be different). But how do we know what would have happened? If the program happened, it happened. Measuring what would have happened requires entering an imaginary world in which the program was never given to these participants. The outcomes of the same participants in this imaginary world are referred to as the counterfactual. Since we cannot observe the true counterfactual, the best we can do is to estimate it by mimicking it.

The key challenge of program impact evaluation is constructing or mimicking the counterfactual. We typically do this by selecting a group of people that resemble the participants as much as possible but who did not participate in the program. This group is called the comparison group. Because we want to be able to say that it was the program and not some other factor that caused the changes in outcomes, it is important that the only difference between the comparison group and the participants is that the comparison group did not participate in the program. We then estimate “impact” as the difference observed at the end of the program between the outcomes of the comparison group and the outcomes of the program participants.

The impact estimate is only as accurate as the comparison group is successful at mimicking the counterfactual. If the comparison group poorly represents the counterfactual, the impact is (in most circumstances) poorly estimated. Therefore the method used to select the comparison group is a key decision in the design of any impact evaluation.

That brings us back to our questions: Did the Learn to Read project work? What was its impact on children’s reading levels?

In this case, the intention of the program is to “improve children’s reading levels” and the reading level is the outcome measure. So, when we ask if the Learn to Read project worked, we are asking if it improved children’s reading levels. The impact is the difference between reading levels after the children have taken the reading classes and what their reading level would have been if the reading classes had never existed.

For reference, Reading Level is an indicator variable that takes value 0 if the child can read nothing, 1 if he knows the alphabet, 2 if he can recognize words, 3 if he can read a paragraph, and 4 if he can read a full story.

What comparison groups can we use? The following experts illustrate different methods of evaluating impact. (Refer to the table on the last page of the case for a list of different evaluation methods).

Estimating the impact of the Learn to Read project

Method 1:

News Release: Read India helps children Learn to Read. Pratham celebrates the success of its “Learn to Read” program—part of the Read India Initiative. It has made significant progress in its goal of improving children’s literacy rates through better learning materials, pedagogical methods, and most importantly, committed volunteers. The achievement of the “Learn to Read” (L2R) program demonstrates that a revised curriculum, galvanized by community mobilization, can produce significant gains. Massive government expenditures in mid-day meals and school construction have failed to achieve similar results. In less than a year, the reading levels of children who enrolled in the L2R camps improved considerably.

J-PAL Executive Education Course Case Study 2:Learn to Read Evaluations

Baseline

Distribution of Endline Scores for Baseline

Non-Readers (Zero)

100

80

.... 60 c: QJ

u... QJ

c..

20

0

Baseline

Word Paragraph

Story

Distribution of Endline Scores for Baseline

Letter Readers

100

80

.... c: QJ

u ... c.. QJ

60

40

20

0

Paragraph Story

40

J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations

Just before the program started, half these children could not recognize Hindi words—many nothing at all. But after spending just a few months in Pratham reading classes, more than half improved by at least one reading level, with a significant number capable of recognizing words and several able to read full paragraphs and stories! On average, the literacy measure of these students improved by nearly one full reading level during this period.

Discussion Topic 1:

1. What type of evaluation does this news release imply?

2. What represents the counterfactual?

3. What are the problems with this type of evaluation?

J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations

Method 2:

Opinion: The “Read India” project not up to the mark Pratham has raised millions of dollars, expanding rapidly to cover all of India with its so-called “Learn-to-Read” program, but do its students actually learn to read? Recent evidence suggests otherwise. A team of evaluators from Education for All found that children who took the reading classes ended up with literacy levels significantly below those of their village counterparts. After one year of Pratham reading classes, Pratham students could only recognize words whereas those who steered clear of Pratham programs were able to read full paragraphs.

Comparison of reading levels of children who took

reading classes Vs. reading levels of children who did

not take them

3

2.5

2

1.5

1

0.5

0

Did not take reading classes/ Took reading classes

Notes: Reading Level is an indicator variable that takes value 0 if the child can read nothing, 1 if he

knows the alphabet, 2 if he can recognize words, 3 if he can read a paragraph and 4 if he can read a full

story.

If you have a dime to spare, and want to contribute to the education of India’s illiterate children, you may think twice before throwing it into the fountain of Pratham’s promises.

Discussion Topic 2:

1. What type of evaluation is this opinion piece employing?

2. What represents the counterfactual?

3. What are the problems with this type of evaluation?

Mean reading level for children who did not take reading classes

Mean reading level for children who took reading classes

R e a d

in g

L e v e l

J-PAL Executive Education Course Case Study 2: Learn to Read Evaluations

Method 3:

Letter to the Editor: EFA should consider Evaluating Fairly and Accurately There have been several unfair reports in the press concerning programs implemented by the NGO Pratham. A recent article by a former Education for All bureaucrat claims that Pratham is actually hurting the children it recruits into its ‘Learn-to-Read’ camps. However, the EFA analysis uses the wrong metric to measure impact. It compares the reading levels of Pratham students with other children in the village—not taking into account the fact that Pratham targets those whose literacy levels are particularly poor at the beginning. If Pratham simply recruited the most literate children into their programs, and compared them to their poorer counterparts, they could claim success without conducting a single class. But Pratham does not do this. And realistically, Pratham does not expect its illiterate children to overtake the stronger students in the village. It simply tries to initiate improvement over the current state. Therefore the metric should be improvement in reading levels—not the final level. When we repeated EFA’s analysis using the more-appropriate outcome measure, the Pratham kids improved at twice the rate of the non-Pratham kids (0.6 reading level increase compared to 0.3). This difference is statistically very significant.

Had the EFA evaluators thought to look at the more appropriate outcome, they would recognize the incredible success of Read India. Perhaps they should enroll in some Pratham classes themselves.

Discussion Topic 3:

1. What type of evaluation is this letter using?

2. What represents the counterfactual?

3. What are the problems with this type of evaluation?

Methodology Description Who is in the comparison group? Required Assumptions Required Data

Pre-Post

Measure how program participants improved (or changed) over time.

Program participants themselves—before participating in the program.

The program was the only factor influencing any changes in the measured outcome over time.

Before and after data for program participants.

Simple

Difference

Measure difference between program participants and non-participants after the

program is completed.

Individuals who didn’t participate in the program (for any reason), but for whom data were collected after

the program.

Non-participants are identical to participants except for program participation, and were equally likely to

enter program before it started.

After data for program participants and non-

participants.

Differences in

Differences

Measure improvement (change) over time of program participants relative to the improvement (change) of non-participants.

Individuals who didn’t participate in the program (for any reason), but for whom data were collected both before and after the program.

If the program didn’t exist, the two groups would have had identical trajectories over this period.

Before and after data for both participants and non- participants.

Multivariate

Regression

Individuals who received treatment are compared with those who did not, and other factors that

might explain differences in the outcomes are

“controlled” for.

Individuals who didn’t participate in the program (for any reason), but for whom data were collected both

before and after the program. In this case data is not

comprised of just indicators of outcomes, but other “explanatory” variables as well.

The factors that were excluded (because they are unobservable and/or have been not been measured)

do not bias results because they are either

uncorrelated with the outcome or do not differ between participants and non-participants.

Outcomes as well as “control variables” for both

participants and non-

participants.

Statistical

Matching

Individuals in control group are compared to

similar individuals in experimental group.

Exact matching: For each participant, at least one

non-participant who is identical on selected

characteristics. Propensity score matching: non-participants who

have a mix of characteristics which predict that they

would be as likely to participate as participants.

The factors that were excluded (because they are

unobservable and/or have been not been measured)

do not bias results because they are either uncorrelated with the outcome or do not differ

between participants and non-participants.

Outcomes as well as

“variables for matching”

for both participants and non-participants.

Regression

Discontinuity

Design

Individuals are ranked based on specific,

measureable criteria. There is some cutoff that

determines whether an individual is eligible to

participate. Participants are then compared to non-participants and the eligibility criterion is

controlled for.

Individuals who are close to the cutoff, but fall on the

“wrong” side of that cutoff, and therefore do not get

the program.

After controlling for the criteria (and other measures

of choice), the remaining differences between

individuals directly below and directly above the

cut-off score are not statistically significant and will not bias the results. A necessary but sufficient

requirement for this to hold is that the cut-off criteria

are strictly adhered to.

Outcomes as well as

measures on criteria (and

any other controls).

Instrumental

Variables

Participation can be predicted by an incidental

(almost random) factor, or “instrumental”

variable, that is uncorrelated with the outcome, other than the fact that it predicts participation

(and participation affects the outcome).

Individuals who, because of this close to random factor, are predicted not to participate and (possibly as a result) did not participate.

If it weren’t for the instrumental variable’s ability to

predict participation, this “instrument” would

otherwise have no effect on or be uncorrelated with the outcome.

Outcomes, the “instrument,” and other control variables.

E x

p er

im en

ta l

M et

h o d

Randomized

Evaluation

Experimental method for measuring a causal

relationship between two variables.

Participants are randomly assigned to the control

groups.

Randomization “worked.” That is, the two groups

are statistically identical (on observed and unobserved factors).

Outcome data for control

and experimental groups.

Control variables can help absorb variance and

improve “power”.

MIT OpenCourseWare http://ocw.mit.edu

Resource: Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Dr. Rachel Glennerster, Prof. Abhijit Banerjee, Prof. Esther Duflo

The following may not correspond to a particular course on MIT OpenCourseWare, but has been provided by the author as an individual learning resource.

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.