Trying It Out - The Role of 'Pilots' in Policy-Making Report of a Review of Government Pilots (PDF)
4.2 Experimental methods randomised controlled trials (RCTs)
4.3 Ethical considerations and RCTs
4.4 Quasi-experimental methods
Even more tricky than decisions about whether to conduct pilots are decisions about how to conduct them. Methods of evaluating a new policy may be summative, formative or both. Summative methods are used to determine whether and to what extent a policy is having its desired effect or impact on its intended target groups. Formative methods are used to shape a policy and/or determine why, how or under what conditions it may be best directed or implemented. Both sorts of evaluations use a range of research methods but typically summative evaluations employ quantitative and/or experimental methods, while formative evaluations rely more on qualitative and/or ethnographic methods. But these distinctions are by no means rigid.
These two broad approaches are complementary rather than competitive, in much the same way as there is a need for both quantitative and qualitative methodology in piloting as in all other forms of evaluation. What matters is rigour and fitness for purpose, not an a priori methodological preference.
Widely acknowledged as the most robust and rigorous of these approaches though sometimes ruled out on practical or political grounds is the randomised controlled trial (RCT), best known for its pivotal role in medical research. In its purest and simplest form, a random sample of people (or units such as schools, housing estates or hospitals) is selected for the experimental design. A random half of them are allocated to the treatment or experimental group, while the other half are allocated to the control or comparison group to measure the counterfactual. As long as the samples in each group are sufficiently large, differences in the outcomes of the two groups can reasonably be attributed to the treatment. The principle behind a randomised controlled trial is that other exogenous or confounding factors that might otherwise influence outcomes ought to be randomly distributed between the treatment and the control group.
As noted, RCTs of individuals are a major form of policy-testing for social interventions in the US and Canada (Greenberg and Shroder, 1997; Boruch, 1997). In Britain, however, while they are still routinely employed in medical trials, they are much more sparingly applied in social policy interventions. Even so, a number of major British pilots have used RCTs, such as the Restart programme (White and Lakey, 1992), New Deal for 25+ (Wilkinson, forthcoming), Employment Zones, Intensive Gateway Trailblazers (Davies and Irving, 2000), and more recently the ERA project (Morris et al., 2003) (Case study 1, p.9). But most British social policy pilots tend to be conducted not only by means of area-based trials in preference to individual-based ones, but also by means of matched comparisons rather than random assignment.
It is fair to report that most of these pilots were bedevilled by practical problems of implementation which greatly reduced their power. It was not their design that let them down, but partly because random assignment is so rarely employed in social policy trials here the staff who were entrusted with implementing the procedures were often ill-prepared and inadequately trained.
Some of the departmental civil servants we interviewed believed that the difficulties of RCTs were exaggerated, but others (as well as two Ministers) continue to regard random assignment with deep suspicion, and only partly for practical reasons (Hogwood, 2001). They point to disadvantages such as the time that RCTs take to set up properly and the careful management they require. They also, rightly, refer to the fact that certain interventions, such as curriculum changes, are almost impossible to allocate randomly between individuals in the same schools. But their principal objections tend to be political or ethical in nature. It is unethical (or at any rate inequitable), so the argument runs, for a government to allocate an obvious benefit to a certain set of individuals selected at random and give neither their neighbours nor indeed another randomly selected (control) group access to the same benefit. Even though only an experiment, they believe it might cause justified resentment among those excluded from the treatment group, particularly perhaps those within the control group itself.
The fact that this procedure is almost universal in medical trials of new drugs, where the potential to save lives is sometimes at stake not merely the differential receipt of a social benefit does not, however, placate their opposition. Nor apparently does the Benthamite justification that any inequity at the individual level can be justified by the large potential gain the knowledge might bring at the mass level.
The opposite ethical worry sometimes expressed about pilots is that the treatment group might be disadvantaged in some way (whether in the short- or long-term) by the treatment they alone are, as yet, receiving. But such worries apply to all experiments or early trials and not specifically to RCTs.
RCTs for social policy trials do, however, differ in one important respect from RCTs for clinical trials. The experimental recipient of, say, a new drug treatment in a medical trial is not necessarily a beneficiary, since these trials are always conducted under conditions of clinical equipoise an absence of evidence as to whether the treatment will be effective. Indeed, if it were known in advance that the treatment would work, the experiment would not take place. Moreover, if there turns out to be clear early evidence of a significant positive effect of the drug, the trial is stopped so that the control group and the population at large are not denied the treatment as in the recent large-scale trial of the cholesterol-reducing drug, Atorvastatin.
Social policy trials take place under a rather different set of conditions. A treatment group that receives, say, a certain financial benefit designed to encourage a change in behaviour tends to be at an obvious advantage over those who receive no such payment. True, the ultimate beneficiary in both sorts of trial may be society at large rather than the individual. Nonetheless, social policy trials do sometimes single out randomly selected individuals for apparently preferential treatment in a way that medical trials in circumstances of clinical equipoise do not. And while there is no real ethical distinction between conferring an advantage on certain randomly selected areas as opposed to other randomly selected areas, the political distinctions are considerable.
|
New Deal for Lone Parents (NDLP) Phase
one prototype Aim: The NDLP prototype was to test the effectiveness of helping lone parents on Income Support (IS) move into work or towards preparing for work with the aid of personal advisers providing tailored packages of help and advice throughout the duration of the scheme. Background: There are some 1.8 million lone parents of working age in Great Britain. Almost 1 million are out of work, with most claiming IS. As most lone-parent families live in low-income households and are likely to experience persistent poverty, finding work is the most important route out of poverty. Methods: The prototype service was launched in summer 1997 in eight areas (Phase 1) and in April 1998 was introduced throughout Britain for all lone parents with new or repeat claims. The final phase was national implementation for all lone parents on IS. Eight Benefits Agency districts were selected to represent different labour market conditions. Lone parents whose youngest child was aged at least five years and three months and who had been claiming IS for at least eight weeks were invited to participate; other lone parents were not contacted but could take part if they came forward. Selection of these groups was based on random allocation into participant and nonparticipant groups, using digits in the National Insurance numbers. Effectively, lone parents were divided into ten groups of approximately equal size, based on these digits, each of which was a random cross-section of the population. The aim of the evaluation was to identify who took part in the programme and why; what helped lone parents into work; the take-up among those eligible; and how much movement into work could be attributed to the programme (the counterfactual). This enabled comparison of random subgroups who had, or had not yet, been invited to participate. Findings: Phase 1 had a small but appreciable effect on the rate of movement off IS and into work. After 18 months the number of lone parents on IS was 3.3 per cent lower than it would have been in the absence of the programme. About 20 per cent of jobs gained following participation in NDLP were estimated to be additional to those that would have been gained without the programme. 28 per cent of lone parents who participated in NDLP and then started work said that their personal adviser had given them significant help in achieving this. Two out of three participants said that they had benefited from the programme. Lessons learned: The evaluation reported on short-term outcomes as each stage of implementation was rolled out in quick succession. Findings confirmed much previous research about the personal impact of lone parenthood and the financial insecurity associated with it. NDLP helped those who were more work ready and those who did not need help with issues like self-confidence, careers guidance, job-search skills, other training and work experience. Contact details/Further information: Prototype evaluation, Jane Sweeting, Department
for Work and Pensions, Tel: 0207 962 8657 National evaluation, Rebecca Hutten, Department
for Work and Pensions, Tel: 0114 259 6259 Hales, J., Lessof, C., Roth, W., Shaw, A., Millar, J. and Barnes M. (2000), Evaluation of the New Deal for Lone Parents: Early Lessons from the Phase One Prototype Synthesis Report, Research Report 108, London: Department of Social Security. Hasluck, C., McKnight, A. and Elias, P. (2000), Evaluation of the New Deal for Lone Parents: Early Lessons from the Phase One Prototype CostBenefit and Econometric Analyses, DSS Research Report 110. Evaluation of the New Deal for Lone Parents: A Comparative Analysis of the Local Study Areas, DSS In-House Research Report 63. |
One perennial difficulty with RCTs in the social arena is that they depend critically on the principle of all other things being equal (ceteris paribus), a condition that is very difficult to achieve in reality. For instance, in the US GAIN Programme (Riccio et al., 1994), where random allocation to an experimental and control group was attempted, only around half of the experimental group turned out to have received the treatment, while around the same proportion of the control group turned out to have received one or more elements of what the programme was delivering. Such contamination effects are common and demonstrate the real difficulty of obtaining a straightforward counterfactual. The fact is that in many of the areas in which policy trials tend to take place, whether in the US or Britain, several trials aimed at different but overlapping groups of people may be in progress at once. The possible contaminating effect of this on each of the trials is considerable and although it is in principle possible to eliminate these overlaps, it is tricky in practice to do so.
Moreover, government programmes whether at their pilot stages or after their full-scale implementation do not stand still in form or content. They adapt and adjust, often in small ways, to take account both of emerging evidence or changing circumstances. It would be a little naïve of those in charge of evaluations to expect such policy or administrative adjustments to be held back simply for the sake of the integrity of a pilot. So, to take account of the fact that pilots do not exist in a neutral social and economic environment, their design needs to be as robust as possible. Large sample sizes whether of areas or individuals help greatly in this respect.
Although we do not hold with the view that RCTs of individuals are the be-all and end-all of piloting methodology, we do believe that they continue to be seriously under-used in Britain in circumstances where their technical advantages would seem to outweigh their other potential difficulties.
Quasi-experimental methods are the usual alternatives to RCTs for impact pilots of new social policy initiatives. They include not only before-and-after studies, but also various types of matched-comparison methods where either areas or individuals, or both, are matched for their characteristics (rather than being selected at random) and then given different treatments. The Family Mediation Pilot (Davis, 2000), the UK Total Purchasing Pilot (Mays et al., 1997), the Chance Pilot (St. James-Roberts and Singh, 2001), the ETU scheme (Marsh, 2001) (Case study 6, p.26) and the EMA pilot (Ashworth et al., 2002; Heaver et al., 2002; Legard et al., 2001; and Maguire and Maguire, 2003) (Case study 5, p.22) have all used quasi-experimental methods of one sort or another.
Quasi-experimental methods vary considerably in the extent to which they approach the precision of random assignment. Some are extremely sophisticated in their matching of treatment and non-treatment groups, using techniques such as propensity score matching to ensure that the treatment and quasi-control group are similar in more respects than, say, their demographic characteristics and economic circumstances. For instance in the NDLP evaluation (Hales et al., 2000) (Case study 4, p.18), the treatment and non-treatment groups were also matched on their attitudes and behaviour prior to their participation. Similarly, in the evaluation of Employment Zones, a mandatory programme, use was made of ward-level unemployment rates and indices of deprivation, as well as of population profiles to derive suitable comparison areas.
Meanwhile, the Jobseekers Allowance evaluation (Rayner et al., 2000; Fielding and Bell, 2002) employed a before-and-after design incorporating a differences in differences method as a quasi-experimental approach to the measurement of impact. Unfortunately, however, changes in the national economy undermined the pilot design, an occupational hazard deriving from the fact that pilots take place in real time. But the New Deal for Young People pilot (Hasluck, 2000) used the same method with a more plausible outcome. A before-and-after design was also used for the Working Families Tax Credit evaluation (McKay, 2001), another example of a policy which has repeatedly changed its form with regrettably little consideration for the researchers involved in its evaluation!
A less rigorous but occasionally helpful method of impact evaluation is a goals-based one, where the aim is simply to assess whether the intended goals of a policy, programme or project have been achieved by a certain date. The obvious problem with this approach is that it tells us nothing about the counterfactual whether the desired goals would have been achieved anyway. It also seldom reveals much about any unintended effects of the new policy.
Many of the pilots and evaluations we have referred to have also made some use of qualitative methods, often alongside quantitative ones such as social surveys. In particular, focus groups and depth interviews are often components of summative as well as of formative evaluations, sometimes with the limited role of helping to develop the methods or buttress the findings. But in order to understand or explain the dynamics of a policy intervention or its uneven effects, numerous other techniques are sometimes deployed in formative evaluations among them deliberative polling, citizens juries, ethnographic research, participant and non-participant observation, operational analysis and documentary searches and analysis. Our view is that insufficient use is made of combined methodologies in pilot evaluations, which can provide insights that are inaccessible to any single method. One way of mitigating this problem is to create an easily accessible library or electronic repository of the wide range of policy pilots that have been, or are being, carried out, with sufficient detail of their origins, methods and outcomes to allow others to learn from their experience. A worrying feature of our enquiries was that the potentially instructive experience of completed pilots was rarely drawn upon outside the department concerned (or sometimes even within it).