Document summary: This is a section from a multi-page
document entitled Large Scale Social Experimentation in Britain: What Can
and Cannot be Learnt from the Employment Retention and Advancement Demonstration?,
GCSRO, November 2003
Alternative download: PDF
file of Large Scale Social Experimentation in Britain: What Can and Cannot
be Learnt from the Employment Retention and Advancement Demonstration?
On this page:
The Employment Retention and Advancement (ERA) Demonstration project is a major new welfare-to-work social experiment, the largest random allocation evaluation ever mounted in Great Britain. This paper draws on experience gained in designing the ERA Demonstration to explore the strengths and limitations of social experimentation for policy evaluation and analysis, and to highlight some of the key issues that need to be considered in designing random allocation experiments.
The ERA Demonstration project will begin towards the end of 2003. The Demonstration will test a package of new services and financial incentives that aim to encourage groups on the margins of the labour market to obtain a job, retain work and advance in employment. Specifically, a new type of personal adviser service - the Advancement Support Adviser - will be tested alongside two new financial incentives: a retention and advancement bonus and a training bonus. The effectiveness of these new services and incentives will be compared to the effectiveness of existing services, notably the New Deal initiatives and financial incentives such as tax credits.
The new services and incentives developed through the ERA programme will be thoroughly tested in six areas of the country. There will be three target groups for the Demonstration: those eligible for the New Deal for Lone Parents (NDLP) and the New Deal for Long-term Unemployed (ND25+), and lone parents working part-time and claiming the Working Tax Credit (WTC). The centrepiece of the evaluation will be an impact study based on an experimental design. Individuals eligible for the ND25+ and NDLP will be randomly allocated to either continue with the New Deals (thereby serving as a control group) or to the ERA programme (thereby serving as a programme group). Similarly, lone parents working part-time and claiming the WTC will be allocated at random to either continue claiming the credit (thereby serving as a control group), or to receive ERA services and incentives in addition to the WTC (thereby serving as a programme group). Impacts will be measured as the difference in mean outcomes (e.g. earnings) between the treatment and control groups.
Random allocation is adopted to estimate the impact of the ERA programme because it provides unbiased or ‘internally valid’ estimates of the programme’s impact. It does so because random allocation ensures that the only differences between programme and control groups at the point of randomisation are random differences - in other words, it ensures that there are no systematic differences between the two groups, and consequently they are statistically equivalent. Counterfactual estimates of programme outcomes (that is, the mean value of outcomes that would have prevailed for the programme group had they not received new services and incentives) can be estimated from the control group. In the absence of the programme, the only difference between the mean values of outcomes for individuals in the control and programme groups are differences that occur at random. As a result, counterfactual estimates of programme outcomes, derived through a control group constructed at random, are considered ‘unbiased’.
A range of alternative quasi-experimental approaches to measuring the effectiveness of the ERA services and incentives can potentially be used instead of random allocation. For example, estimates of programme impacts can be derived from simple ‘before and after’ type estimators. Alternatively, counterfactual samples can be selected from carefully matched comparison or control areas. Those eligible for a programme who fail to join it can also be sampled and used to construct counterfactual estimates. Some form of matching, such as that based on propensity scores, can be used to improve quasi-experimental estimates of programme impacts. Despite these refinements, all of the quasi-experimental alternatives to random allocation possess substantial drawbacks. The crux of the problem centres on the inability of quasi-experimental methods to deal convincingly with the problem of unobserved selection bias.
Not withstanding the benefits of an experimental design, significant barriers exist to the proper implementation of random allocation. Moreover, there are clearly instances where random allocation is unsuitable on ethical grounds.
The twin problems of ‘crossovers’ and ‘contamination’ provide appreciable challenges to the designers of social experiments. Crossovers occur when individuals are no longer allocated to programme and control groups by chance alone, and some systematic component enters into the process of allocation. Contamination occurs when individuals assigned to the control group inadvertently receive services or treatments intended for programme group members.
The ERA experimental design seeks to limit the potential for crossovers and contamination to occur, through ensuring that, where possible, programme services and incentives are delivered by a separate group of staff. Furthermore, technical advisers will be on hand to ensure that frontline staff observes random allocation protocols and that administrative records are kept so that it is clear whether a given individual is a member of the programme or control group. A centrally-administered random allocation algorithm ensures that both administrators and customers are unable to ‘game’ the allocation process.
One of the key issues in ensuring that social experiments produce useful findings is to consider carefully the selection of localities (experimental sites) where the experiment will be implemented. Ideally, experimental sites would be selected at random, but this is seldom feasible and was not an option for the ERA Demonstration. Instead, six experimental sites were selected from across Great Britain on the basis of a number of criteria. These included the need to avoid Jobcentre Plus¹ districts engaged in major administrative reorganisation, the need to obtain a reasonable geographical spread of sites and the need to be able to select samples of sufficient size in each site, such that programme impacts might be detected at site level.
In common with many quasi- and experimental evaluations in Great Britain, the ERA Demonstration is largely reliant on survey data to measure outcomes. The larger the size of survey samples, the smaller the impacts that can be detected. The problem is that it is expensive to collect data from survey samples. The ERA Demonstration project used the concept of Minimum Detectable Impact to identify the most appropriate trade-off between cost and sample size.
Random allocation designs rely on computing the difference between average values for programme group outcomes (e.g. earnings) and averages control group outcomes, in order to estimate the impact of the programme or intervention under investigation. In many social experiments, however, simple experimental comparisons are not sufficient to address the full range of questions that evaluators wish to consider. In the case of the ERA Demonstration, one of the key issues is whether the programme leads to improvements in hourly wages and increased wage progression among those in the programme group. Because only a fraction of the programme and control groups will enter employment and thereby record hourly wages, and it is anticipated that the process of obtaining work by members of the programme group will be influenced by ERA services and incentives, it is highly likely that a simple comparison of wage rates between employed programme and control group members will not yield unbiased estimates of programme effectiveness. For this reason, quasi-experimental methods will be required in addition to simple comparisons of programme and control group outcomes.
Social experiments seek to answer questions about causality and the impact of programmes or interventions. There is a range of questions of interest to policymakers and evaluators, however, which social experiments can either not address at all (because they are not designed to do so) or that can only be addressed with specific modifications to the experimental design. But such modifications often render the practical implementation of experiments problematic.
One of the main charges levelled against experiments is that they fail to provide an explanatory account of the processes that give rise to observed programme impacts. This limitation is frequently termed the ‘black box problem’. For example, the ERA Demonstration involves the delivery of both caseworker services and financial incentives as a single package. The experimental design - the allocation of participants to a single programme group or to a control group - does not allow separate experimental estimates of the impact of Advancement Support Adviser (ASA) services and the separate impact of financial incentives. In order to address the issue of the relative effectiveness of different elements of the ERA programme, more complex, differential, randomised designs are required. These designs require both larger sample sizes to make multiple comparisons and place a greater administrative burden on frontline staff, consequently increasing the likelihood of administrative error. For these reasons, a differential design for the ERA programme was rejected, despite the analytical gains that can result from such designs. As a result, the evaluation of ERA relies heavily on a nonexperimental, observational process study to uncover evidence of the separate contributions that different components of the programme make to programme impacts, should these impacts actually materialise.
A critical issue in evaluation is that of ‘external validity’ - the extent to which estimated programme impacts can be generalised to different locations and populations, to different time periods and to different variants of the programme being studied. Generalisability is an issue for all forms of evaluation, including social experimentation. Results from an experiment might not hold at different time points and in different geographical localities. Experimental impact estimates are usually derived from the context of a pilot or demonstrations limited to a particular set of areas and are thus smaller in scale than in a national programme. As a result, it may be problematical to infer the impact of a national full-scale programme from a smaller-scale experiment. Furthermore, substitution effects, Hawthorne effects, entry effects and general equilibrium effects may all limit the capacity to draw generalisable estimates of programme impacts from a single experiment.
The ERA Demonstration illustrates both the strengths and weaknesses of social experiments in evaluating social programmes. For evaluating ERA, and a wide variety of other social policy interventions, an experimental design is superior to alternative designs that might be used instead - for example, a ‘before and after’ comparison, matched sites, or a participant/non-participant comparison. It will provide greater assurance of internal validity, while being no more costly or timeconsuming. However, this does not mean that experimental designs are always superior for evaluating all social policies; just that experiments are often advantageous, and that random allocation clearly is the best approach for evaluating ERA. Quasi-experimental methods may be less expensive and less time-consuming than random assignment for evaluating existing programmes. Moreover, occasionally there are ethical reasons for not using random allocation. Nonetheless, if implemented and run properly, an experimental design will almost always provide greater internal validity than alternative approaches.
No single evaluation design can answer all the questions about a specific social policy that are of interest, and random allocation is no exception. Sometimes, however, certain design modifications can be made that can help address certain issues. For example, although ultimately not adopted, consideration was given to using a differential experimental design for the ERA Demonstration in order to determine whether the impact of combining financial incentives with services is greater than the impact of financial incentives alone. Other limitations of a single evaluation design can be at least partially overcome by combining several different approaches. For example, quasi-experimental econometric methods will be required to examine certain issues concerning ERA’s impact on advancement, while a process study will be used to help determine the context and the manner in which ERA services were delivered.
There are certain important questions that no combination of evaluation methods can definitively address, however. For example, neither experimental nor non-experimental methods will be able to provide more than limited information about which specific components of ERA are most or least effective - the so-called ‘black box problem’. In addition, once findings from the ERA Demonstration become available, uncertainty will inevitably remain about their ‘external validity’ - that is, the extent to which they can be generalised to different locations and populations and to different time periods; whether they are subject to scale bias, general equilibrium wage effects, substitution effects and/or Hawthorne effects; and whether entry effects might occur if ERA is rolled out nationally that did not arise during the Demonstration - regardless of the combination of experimental and non-experimental methods that were used to obtain them.
Footnotes:
1. The ERA Demonstration project is to be delivered through Jobcentre Plus.
Crown copyright © 2003; Published November 2003.