Evaluation in the Practice of Development

Evaluation in the Practice of Development

Martin RavallionDevelopment Research Group, World Bank

Further reading: “Evaluation in the Practice of Development,” World Bank Research Observer, Spring 2009.“Should the Randomistas Rule?” Economists Voice Vol. 6, No. 2. 2009

Lecture at the “Perspectives on Impact Evaluation” Conference, Cairo, 2009

2

“Seeking truth from facts.” • In 1978, the Chinese Communist Party’s 11th Congress broke

with its ideology-based view of policy making in favor of a pragmatic approach, which Deng Xiaoping famously dubbed “feeling our way across the river.”

• At its core was the idea that public action should be based on evaluations of experiences with different policies—“the intellectual approach of seeking truth from facts.”

• In looking for facts, a high weight was put on demonstrable success in actual policy experiments on the ground.

• Not Randomized Control Trials (RCTs); would not pass muster by modern standards. But the basic idea was right.

The rural reforms that were then implemented nationally helped achieve probably the most dramatic reduction in the extent of poverty the world has yet seen.

3

1. We under-invest in evaluation

There are significant gaps between what we know and what we want to know about development effectiveness.

These gaps stem from distortions in the market for knowledge.

2. Current evaluations fall short

Standard approaches are not geared to addressing these distortions and consequent knowledge gaps.

We can do better! 10 steps for more relevant impact evaluations.

4

Why are their persistent gaps in our knowledge about development

effectiveness?

5

Knowledge market failures

Suppliers and demanders of knowledge meet, but: • Asymmetric information about the quality of the evaluation.

– Development practitioners cannot easily assess the quality and expected benefits of an impact evaluation, to weigh against the costs.

– Short-cut non-rigorous methods promise quick results at low cost, though rarely are users well informed of the inferential dangers.

• Externalities: Benefits spillover to future development projects. – But current individual projects hold the purse strings.– The project manager will typically not take account of the

external benefits when deciding how much to spend on evaluative research.

– Larger externalities for some types of evaluation (first of its kind; “clones” expected; more innovative)

6

Current research priorities are not in line with the needs of practitioners

• Classic concern is with internal validity for mean treatment effect on the treated for an assigned program with no spillover effects.

• And internal validity is mainly judged by how well one has dealt with selection bias due to unobservables.

This approach has severely constrained the relevance of impact evaluation to development policy making.

7

Bigger problem for some evaluations

• Far easier to evaluate an intervention that yields its likely impact within one year (say) than one that takes many years. • No surprise that credible evaluations of the longer-term impacts of (for example) infrastructure projects are rare. • We know very little about the long-term impacts of development projects that do deliver short-term gains; . • So future practitioners are often poorly informed about what works and what does not.

Biases in our knowledge, favoring projects with well-defined beneficiaries and yielding quick results.

8

Are social experiments the solution?• No obvious reason why doing more social experiments would help

correct for the distortions that generated those knowledge gaps.– Randomization is only feasible for a non-random sub-set of policies

and settings; • It is rarely feasible to randomize the location of infrastructure projects and

related programs, which are core activities in almost any poor country’s development strategy.

– And when randomization is a relevant tool, it will be adopted more readily in some settings than others

• Ethical and political concerns loom large—stemming from the fact that some of those to which a program is randomly assigned will almost certainly not need it, while some in the control group will.

Better idea: randomize what gets evaluated rigorously and then chose a method appropriate to each sampled intervention, with randomization as one option.

9

Dissemination/publication is a crucial but weak link in the learning process

• Externalities again: High current costs of completing the cycle of knowledge generation, with benefits accruing to others.

• Prior beliefs (based on past publications) seem to get too high a weight in the review process.– Reviewers tend (past contributors) often over-state the precision

of past knowledge (priors are overweighed in Bayesian sense) • Academic objectives dominate: emphasis on internal validity

even if the question addressed is of little relevance to filling pressing knowledge gaps about development effectiveness

10

Researcher incentives

The incentives of researchers are not well aligned with prevailing knowledge gaps.– Choice of topic and what to report; multiple outcomes, with

selective reporting– If you have enough outcome variables you will eventually

have something to report, even if there is no real impact – Replication is seen as second-class citizen– Weak incentives to make data public and weak coverage

and enforcement of journal policies

11

Rising donor interest

• There is now a broader awareness of the problems faced when trying to do evaluations, including the age-old problem of identifying “causal” impacts. • This has helped make donors less willing to fund weak proposals for evaluations that are unlikely to yield reliable knowledge about development effectiveness.

However,..… the resources do not always go to rigorous evaluations.• Nor are the extra resources having as much impact as they could on the incentives facing project managers, governments and researchers.• Donor support needs to focus on increasing marginal private benefits from evaluation, or reducing marginal costs.

12

Ten steps to more policy-relevant impact evaluations

Can we do better?

13

1: Start with a policy-relevant question and be eclectic on methods

• Policy relevant evaluation must start with interesting and important questions.

• But instead many evaluators start with a preferred method and look for questions that can be addressed with that method.

• By constraining evaluative research to situations in which one favorite method is feasible, research may exclude many of the most important and pressing development questions.

14

Standard methods don’t address all the policy-relevant questions

• What is the relevant counterfactual?– “Do nothing”: that is rare; but how to identify more relevant

CF?– Example from workfare programs in India (do nothing CF vs.

alternative policy)

• What are the relevant parameters to estimate? – Mean vs. poverty (marginal distribution) – Average vs. marginal impact– Joint distribution of YT and YC , esp., if some participants are

worse off: ATE only gives net gain for participants.– Policy effects vs. structural parameters.

• What are the lessons for scaling up?• Why did the program have (or not have) impact?

15

2. Take the ethical objections and political sensitivities seriously; policy makers do! • Pilots (using NGOs) can often get away with methods not acceptable

to governments accountable to voters.• Deliberately denying a program to those who need it and providing the

program to some who do not.- Yes, too few resources to go around. But is randomization the fairest solution to

limited resources?- What does one condition on in conditional randomizations?

• Intention-to-treat helps alleviate these concerns=> randomize assignment, but free to not participate

• But even then, the “randomized out” group may include people in great need.

The information available to the evaluator (for conditioning impacts) is a partial subset of the information available “on the ground” (incl. voters)

16

3. Taking a comprehensive approach to the sources of bias

• Two sources of selection bias: observables and unobservables (to the evaluator) i.e., participants have latent attributes that yield higher/lower outcomes• Some economists have become obsessed with the latter bias, while ignoring enumerable other biases/problems.

• Less than ideal methods of controlling for observable heterogeneity including ad hoc (linear, parametric) models of outcomes. • Evidence that we have given too little attention to the problem of selection bias based on observables.• Arbitrary preferences for one conditional independence assumption (exclusion restrictions) over another (conditional exogeneity of placement)

We cannot scientifically judge appropriate assumptions/ methods independently of program, setting and data.

17

4. Do a better job on spillover effects

• Are there hidden impacts for non-participants?• Spillover effects can stem from:

• Markets • Behavior of participants/non-participants• Behavior of intervening agents (governmental/NGO)

Example 1: Employment Guarantee Scheme • assigned program, but no valid comparison group if the program works the way it is intended to work.

Example 2: Southwest China Poverty Reduction Program• displacement of local government spending in treatment villages => benefits go to the control vaillages• substantial underestimation of impact• Model implies that true DD=1.5 x empirical DD• Key conclusions on long-run impact robust in this case

18

5. Take a sectoral approach,

recognizing fungibility/flypaper effects• Fungibility

• You are not in fact evaluating what the extra public resources (incl. aid) actually financed.• So your evaluation may be deceptive about the true impact of those resources.

• Flypaper effects • Impacts may well be found largely within the “sector”. • Example for Vietnam roads project: fungibility within transport sector, but flypaper effect on sector.• Need for a broad sectoral approach

19

6. Fully explore impact heterogeneity

• Impacts will vary with participant characteristics (including those not observed by the evaluator) and context.

• Participant heterogeneity– Interaction effects

– Also essential heterogeneity, with participant responses (Heckman-Urzua-Vytlacil)

– Implications for: • evaluation methods (local instrumental variables estimator)

• project design and even whether the project can have any impact. (Example from China’s SWPRP.)

• external validity (generalizability) =>

20

Impact heterogeneity cont.,

Contextual heterogeneity

– “In certain settings anything works, in others everything fails”– Local institutional factors in development impact

• Example of Bangladesh’s Food-for-Education program• Same program works well in one village, but fails hopelessly nearby

21

7. Take “scaling up” seriously

With scaling up:• Inputs change:

– Entry effects: nature and composition of those who “sign up” changes with scale.

– Migration responses.

• Intervention changes:– Resources effects on the intervention

• Outcomes change:– Lags in outcome responses– Market responses (partial equilibrium assumptions are fine for

a pilot but not when scaled up)– Social effects/political economy effects; early vs. late capture.

But little work on external validity and scaling up.

22

Examples of external invalidity: Scaling up from randomized pilots

• The people normally attracted to a program do not have the same characteristics as those randomly assigned + impacts vary with those characteristics

=>“randomization bias” (Heckman & Smith)

• We have evaluated a different program to the one that actually gets implemented nationally!

23

8. Understand what determines impact

• Replication across differing contexts– Example of Bangladesh’s FFE:

• inequality etc within village => outcomes of program• Implications for sample design => trade off between precision of overall

impact estimates and ability to explain impact heterogeneity

• Intermediate indicators– Example of China’s SWPRP

• Small impact on consumption poverty• But large share of gains were saved

• Qualitative research/mixed methods– Test the assumptions (“theory-based evaluation”)– But poor substitute for assessing impacts on final outcome

In understanding impact, Step 9 is key =>

24

9. Don’t reject theory and structural models

• Standard evaluations are “black boxes”: they give policy effects in specific settings but not structural parameters (as relevant to other settings).

• Structural methods allow us to simulate changes in program design or setting.

• However, assumptions are needed. (The same is true for black box social experiments.) That is the role of theory.

• PROGRESA example (Attanasio et al.; Todd & Wolpin)

• Modeling schooling choices using randomized assignment for identification

• Budget-neutral switch from primary to secondary subsidy would increase impact

25

10. Develop capabilities for evaluation within developing countries

• Strive for a culture of evidence-based evaluation practice.– China example: “Seeking truth from fact” + role of research

• Evaluation is a natural addition to the roles of the government’s sample survey unit. – Independence/integrity should already be in place.

– Connectivity to other public agencies may be a bigger problem.

• Sometimes a private evaluation capability will be required.

26

There are significant gaps between what we know and what we want to know about development

effectiveness.

These gaps stem from distortions in the market for knowledge.

Standard approaches to impact evaluation are not geared to addressing these distortions and

consequent knowledge gaps.

We can do better!

Date post:	24-Jan-2016
Category:	Documents
Upload:	pink
View:	28 times
Download:	0 times

Evaluation in the Practice of Development

Documents