Why We Should be Wary of Single-Center Trials

Rinaldo Bellomo, MD, FRACP, FJFICM; Stephen J. Warrillow, MBBS, FRACP, FJFICM; Michael C. Reade, MBBS, MPH, DPhil, FANZCA, FJFICM

Disclosures

Crit Care Med. 2009;37(12):3114-9. 

In This Article

Abstract and Introduction

Abstract

Objectives: To highlight the limitations of single-center trials in critical care, using prominent examples from the recent literature; to explore possible reasons for discrepancies between these studies and subsequent multicenter effectiveness trials; and to suggest how the evidence from single-center trials might be used more appropriately in clinical practice.
Study Selection: Topical and illustrative examples of the concepts discussed including trials of patient positioning, the use of steroids for acute respiratory distress syndrome, the dose of hemofiltration, the control of glycemia, and the targets of resuscitation in sepsis.
Data Synopsis: Many positive single-center trials have been contradicted when tested in other settings and, in one case, the subsequent definitive multicentered trial has found a previously recommended intervention associated with active harm. Problems inherent in the nature of single-center studies make recommendations based on their results ill advised. Single-center studies frequently either lack the scientific rigor or external validity required to support widespread changes in practice, and their premature incorporation into guidelines may make the conduct of definitive studies more difficult.
Conclusions: We recommend that practice guidelines should rarely, if ever, be based on evidence from single-center trials. Physicians should apply the findings of single-center trials only after careful evaluation of their methodology, and in particular after comparing the context of the trial with their own situation.

Introduction

Timeo Danaos et dona ferentes. (I fear the Greeks even when bearing gifts.)Virgil, The Aeneid

In Greek mythology, according to Virgil in The Aeneid, so spoke the Trojan priest Laocoön upon seeing the wooden horse left by the Greeks on the beach near Troy. The goddess Minerva, friendly to the Greeks, sent a serpent to strangle Laocoön and his two sons, Antiphantes and Thymbraeus. His warning went unheeded and Troy was destroyed. This quotation is now used to advise caution when faced with a poorly understood gift that might be, as it were, a Trojan horse. While hoping to escape the fate of Laocoön and his sons, here we argue that positive single-center trials, particularly in critical care, should similarly be treated cautiously. Highlighting a selection of recent examples, we argue that when poorly understood, single-center studies incorporated into guidelines may prove inimical to truth. They should form the preliminary evidence needed to justify further investigations, rather than trigger widespread changes in practice.

Attraction of Single-center Trials

Single-center studies are often an essential starting point for testing interventions in critical care. They allow larger, multicenter studies to be planned appropriately and powered. They have several advantages over multicenter trials: they are logistically easier; they are cheaper; they do not require prolonged negotiations on the study protocol; they simplify data collection; and they typically deal with a less heterogeneous population, thereby diminishing confounding. Studies of semirecumbent patient positioning,[1] the use of steroids for acute respiratory distress syndrome,[2] the intensity of renal replacement,[3,4] the control of glycemia,[5] and the targets of resuscitation in sepsis[6] form a nonexhaustive list of studies given prominence because physicians are naturally biased toward believing that effective therapies can be identified for their patients and because these studies were all "positive."

Imperfect-but Better Than No Evidence?

Much of the critical care evidence base is comprised of single-center trials, and in the absence of better information, such trials have been incorporated into clinical guidelines with the argument that they represent "best available evidence." Some argue this is preferable, if the alternative is practice based on opinion and experience. There is evidence that bundled-care processes, many of which implement the results of single-center trials, improve outcome when broadly applied.[7] However, protocolizing care is itself beneficial,[8,9] so such studies of subsequent broad application do not necessarily support the conclusions of single-center trials. Patients may be better served by protocols specific to individual units. We contend that care based on the opinion of competent practitioners that is tailored to their specific circumstances may be superior to that which may be delivered if the results of single-center trials are inappropriately applied. For example, the control of critical illness glycemia[5] might be effective when employed in the same rigorous manner as in the original trial, but safety and efficacy might depend on personnel unavailable in another hospital.[10] This theoretical concern was validated by the recently published NICE-SUGAR (Normoglycaemia in Intensive Care Evaluation-Survival Using Glucose Algorithm Regulation) multicentered effectiveness trial of intensive glycemic control,[11] which was not only unable to replicate the benefit of this approach found in the earlier trial but also found that, when broadly applied, the intervention did significant harm. Sometimes, therefore, blindly implementing imperfect "best available evidence" can be actively detrimental to patients.

Some argue that the development of evidence-grading systems should be sufficient to defend against the inappropriate application of the results of single-center trials. We have previously highlighted our concerns with current classifications systems.[12] Guidelines are often interpreted as the minimum acceptable standard of care. Regardless of the complexity of evidence quality and even their own advice to the contrary, they have been used as pay-for-performance measures[13] and in litigation.[14] Physicians may act against their better judgment on the basis of guideline recommendations. For example, 86% of Chinese intensivists would use central venous pressure to guide resuscitation due to its recommendation by the Surviving Sepsis Campaign (SSC), despite this being based on low grade (C) evidence and despite only 47% believing in its utility.[15] Qualifying the strength of a recommendation therefore seems relatively ineffective in affecting its subsequent widespread implementation.

Problems with Single-center Trials

When widely publicized or incorporated into guidelines, the results of single-center trials risk being broadly implemented. Unfortunately, we consider that there is theoretical and empirical evidence to suggest their results are likely to be seriously flawed.

Limited External Validity

The primary and most obvious shortcoming of single-center studies is their potentially limited external validity. Interventions tested in a single clinical environment are not necessarily generalizable to a broader population, and this may be particularly true in critical care. What constitutes "intensive care" varies widely between and even within countries.[16] It can be determined by resources (manifest, for example, by nurse/patient ratios and the predicted mortality thresholds that warrant intensive care unit admission), case-mix (which varies with the age profile and comorbidities of the population served), and the culture of end-of-life care. Such factors influence both baseline risk and the care received by the control group, two factors that frequently interact with an intervention's efficacy and that are commonly much more important determinants of outcome than the intervention itself.

Experience in critical care research suggests that the baseline risk of patients to whom an intervention is applied may be particularly important. The prevalence of the outcome measure in the control population is an (albeit imperfect) estimate of baseline risk. In a single-center, randomized, controlled study of ventilator-associated pneumonia,[1] placing patients in a semirecumbent position (45° vs. 0°) was found to decrease the rate of ventilator-associated pneumonia from 34% to 8%. This seemingly impressive finding has been incorporated subsequently into clinical practice guidelines.[17,18] However, a more recent multicenter, randomized, controlled trial of a similar approach[19] found no difference, perhaps because the frequency of ventilator-associated pneumonia in the control group (supine patients) was only 6.5% (Fig. 1), closer to the 9.3% found in a large database study.[20] The intervention may have only been effective in a population at atypically high risk. Admittedly, there are other possible explanations for the discrepancy: The subsequent multicenter trial aimed for 45° vs. 10° (a lesser difference). Although not noted in the paper, this different design was perhaps chosen because, by the time of the later study, it was thought unethical to place patients supine—an example of how premature incorporation into guidelines can make more rigorous testing difficult. Additionally, in practice, the investigators achieved much less than the targeted difference (23° vs. 14°) by day 5, highlighting the need to confirm the more general feasibility of interventions performed by protagonists in single-center trials. Of note in this case, the first positive trial never reported compliance data.

Figure 1.

Variability in the rates of ventilator-associated pneumonia (VAP) in recent studies of semirecumbent positioning.

Concerns of external validity[21] have also been expressed in relationship to the study of early goal-directed therapy (EGDT) for the treatment of severe sepsis:[6] Mortality in the control group of that study was approximately 70% higher than in several studies of similar patients in Australia, New Zealand,[22–24] and a U.S. academic medical center[25] (Fig. 2). When interventions are likely to have few adverse effects and cost little, as is the case with semirecumbent positioning, the harm done by ineffectively applying them to a low-risk population is small. However, where the intervention is potentially expensive (such as EGDT) or has potentially serious side effects (such as intensive glycemic control), it is wise to perform multicenter trials in the population of interest, to be sure any potential benefit does outweigh the harm. These considerations are appreciated in the lay press,[26] if not yet widely within our specialty.

Figure 2.

Variability in hospital mortality rates in recent studies of patients with severe sepsis presenting to the Emergency Department. EGDT, early goal-directed therapy.

Implausible Effect Size

It is almost axiomatic that interventions subjected to large-scale, multicenter, clinical trials will show at best a modest effect size. If this were not the case, there would be little need to conduct such a trial. Funnel-web spider antivenom was licensed for use in humans on the basis of experience in two patients,[27] because its effect was dramatic and obvious. The commonest treatments in intensive care medicine (such as vasopressor use for hypotension, or epinephrine for anaphylaxis) have never been subjected to placebo-controlled trials because their effect size is large and immediately apparent. In contrast, intravenous thrombolysis for acute myocardial infarction is accepted as highly effective, but the initial multicenter trials of streptokinase found comparatively modest reductions in 21-day mortality: GISSI (Gruppo Italiano lo Studio della Streptochinasi nell Infarto Miocardico)[28] found streptokinase reduced mortality from 13% to 10.7% (a relative risk reduction [RRR] of 18%), and ISAM (Intravenous Streptokinase in Acute Myocardial Infarction)[29] found a nonsignificant reduction from 7.1% to 6.3% (RRR = 11%). Similarly, the initial multicenter trial of activated protein C reduced 28-day mortality of patients with severe sepsis from 30.8% to 24.7% (RRR = 19.4%).[30]

These precedents in the literature cast some doubt over the plausibility of reported effect sizes in many recent single-center studies. EGDT in severe sepsis reduced mortality from 46.9% to 30.5% (RRR = 35%).[6] Intensive insulin therapy reduced intensive care unit mortality in a largely cardiac surgical population from 8.0% to 4.6% (RRR = 42.5%).[5] Comparisons of high- vs. low-dose renal replacement therapy for acute renal failure have reported reductions in mortality from 57% to 41% (RRR = 28%)[3] and 46% to 28% (RRR = 39%).[4] Such large effect sizes are, not surprisingly, rarely replicated in multicenter effectiveness trials: For example, dialysis dose had no effect on 60-day mortality in a 1124-patient trial, where mortality in control and intervention groups was 51.5% and 53.6%.[31] Intensive insulin therapy was ineffective in the same institution in a different population with a higher baseline risk (control 28-day mortality, 26%)[32] and in multicenter studies.[11,33,34] EGDT is currently the subject of three multicenter trials: Protocolized Care for Early Septic Shock (ProCESS), http://clinicaltrials.gov/ct2/show/NCT00510835; Australasian Resuscitation in Sepsis Evaluation (ARISE) (Australian New Zealand Clinical Trials Registry ACTRN12608000053325); and Protocolised Management in Sepsis (ProMISe) (trial funded but not yet registered or recruiting patients).

Unequal Allocation of Resources

Other limitations of single-center studies may be less apparent. Such studies[3,5,6] are performed frequently by a protagonist with highly atypical expertise and commitment. If delivery of a complex intervention relies on such dedication, it may be impossible to implement elsewhere. The delivery of many interventions requires additional staff time or expertise, a difference that can alter outcome independently of the treatment effect. For example, in intensive care units that routinely allocate one nurse to two patients, a patient randomized to the active intervention (for example, intensive glycemic control)[5] may receive more nursing attention than those patients allocated to standard care. Furthermore, if the unit's nursing resources are not increased, control patients may receive even less attention than usual.

Lack of Blinding

Investigators in many of the positive single-center studies embraced by the intensive care community have not been blinded to treatment allocation.[1,3–6,35] This lack of blinding was in most cases inevitable: For example, it is almost impossible to study investigator-blinded intensive insulin therapy, semirecumbent patient positioning, sedation cessation, high-dose hemofiltration, or Scvo2-guided resuscitation. Nonetheless, knowledge of treatment allocation creates several problems, especially when the investigator wears the dual hats of scientist and clinician. Suspecting bias in patient care does not necessarily imply investigators consciously try to treat some patients better than others so as to obtain the desired result. All clinical members of the staff know the goal of the study and can (consciously or unconsciously) try to "please" the investigator. This Hawthorne effect can be extremely powerful in changing behavior[36,37] and patient outcome.[38] The reverse Hawthorne effect may also play a role if clinicians alter their approach to the care of patients not randomized to the active intervention, or if nurses have less time to spend on such patients.

In single-center studies, knowing the treatment allocation of recent patients makes concealing the likelihood of randomization to a particular group for subsequent patients difficult. Most studies use block randomization to ensure relatively even numbers in each group. If an investigator knows the block size, it is often possible to be sure or nearly sure of the next treatment allocation. If the block size is six and the last three patients have been randomized to treatment A, the next three will all be allocated to treatment B. Using permuted blocks makes the guessing harder but, to a degree, the effect persists. Patient recruitment can then become consciously or unconsciously modulated by the knowledge of what treatment the patient is most likely to receive.

In unblinded single-center studies, there is, in effect, an ongoing interim analysis. Unless a robust independent data safety monitoring committee ensures that a predetermined number of patients is randomized and that data analysis is performed according to a prespecified plan, this unaccounted-for "alpha spending"[39] substantially increases the risk of falsely finding a positive treatment effect. Such supervision is not typical of single-center studies. None of the prominent single-center trials discussed here has published before its conclusion, in either print or electronic form, its statistical analysis plan. As a consequence, it has not been possible to know whether a particular outcome measure was chosen a priori, or identified subsequently when the investigator had knowledge of its comparative features. The requirement for prospective registration of clinical trials should in the future alleviate this problem, although many as yet unreported trials were only registered after they completed enrollment. Retrospective registration is, for the moment, considered acceptable by many journals, so this difficulty is likely to persist for some time.

Examples of Single-center Trials That Have Been Subsequently Contradicted

Given the above difficulties, one might expect to find numerous examples of interventions found to be effective in single-center studies that subsequently have been found ineffective or harmful on further testing. Examples do exist in critical care: A 1998 trial of corticosteroids in 24 patients with late adult respiratory distress syndrome found a significant treatment benefit[2] that was not supported in a later definitive multicenter trial;[40] a 1993 trial of 67 patients concluded supranormal oxygen delivery was beneficial in critical illness,[41] whereas later, larger trials found no benefit.[42,43] The 2001 trial of intensive insulin therapy[5] has not been replicated by either the VISEP (Efficacy of Volume Substitution)[33] or Glucontrol[34] studies, and a recent meta-analysis[10] found no effect on hospital mortality but a significant increase in the risk of hypoglycemia. Hypoglycemia might explain the increased mortality observed in NICE-SUGAR.[11] In a single-center study, Schiffl et al found that increasing the dialysis dose increased survival in patients with acute renal failure.[4] In another study of acute renal failure, Ronco et al also found that, in 425 patients, increasing the dose of hemofiltration increased survival.[3] Both studies, however, have been contradicted by the findings of the recent multicenter Veterans Affairs/National Institutes of Health dose of dialysis trial.[31] This list becomes many-fold larger if considering single-center studies showing improvement in surrogate end points that have been contradicted by subsequent multicentered trials.[44]

Possibly of even greater concern than the premature adoption of novel treatments is the observation that once incorporated into guidelines, positive single-center studies using mortality end points can be more difficult to repeat in large, multicenter, effectiveness trials. For example, the confirmatory multicentered trial of head-up positioning during mechanical ventilation did not place control patients supine, possibly because by then equipoise was perceived to be lost.[19] There was reluctance from a number of centers to join the ProCESS trial of EGDT as clinicians felt this had already become a standard of care in their institutions (D. Angus, personal communication). The CORTICUS (Corticosteroid Therapy of Septic Shock) authors suspected enrollment in their trial of steroids in septic shock was low because clinicians were reluctant to deviate from guidelines in force at the time of the trial.[45] That so many confirmatory trials have been performed argues that these difficulties can be overcome, but only at greater expense or delay.

Merits of Single-center Trials

Notwithstanding the above arguments, single-center studies must be encouraged. They allow better planning of definitive multicenter trials—e.g., the single-center study of sedation interruption[35] that formed the basis (with a modified protocol involving spontaneous breathing trials) of the positive Awakening and Breathing Controlled trial.[46] They provide essential training in research methodology for junior investigators, and promote a culture of investigator-driven research within our specialty. In this manuscript, we have highlighted a number of examples of single-center trials with substantial shortcomings. Although this is not an exhaustive review and we accept the possibility that we have overlooked definitive evidence based on results from a single center, we contend the examples we describe are sufficient to argue against the incorporation of all but the most exceptional single-center trial into international guidelines. The meta-analysis of a number of such studies, even if performed using individual patient data, does not overcome all these limitations. If the notion that "the plural of anecdote is not data" implies that the plural of a case report is not evidence, then we suggest by analogy that the plural of single-center study is not necessarily robust truth.

An Alternative to the Premature Adoption of the Recommendations of Single-center Trials

Multicenter trials are difficult to conduct, and when underpowered or poorly conducted may be even less useful than a single-center trial. In the absence of robust multicenter trial evidence, should we entirely abandon reference to the literature? Certainly not. Quality improvement programs based on protocols drawn from single-center trials are known to work,[47] but these must be relevant to the local environment. What works in one location may not in others[48] due to the above-listed differences in baseline risk, intercurrent care, opportunity cost, and the fidelity with which the intervention can be delivered. Audit of local outcomes is an essential feature when assessing the applicability of single-center trials. If local outcomes are already superior to those in a single-center trial,[22] the results of the trial may not apply. Faith in local audit must be tempered by appreciation of its limitations. Audit data can be influenced by secular trends in intercurrent care and by the problems inherent in unblinded research, but it has the great advantage of being inherently applicable to local conditions. Local audit of outcomes following the introduction of an intervention based on poor quality or contextually less relevant data is therefore imperfect but nonetheless essential.

Comments

3090D553-9492-4563-8681-AD288FA52ACE
Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as:

processing....