The Primary Endpoint Is Positive: What More Do You Need?

Gregg W. Stone, MD

Disclosures

February 06, 2017

This feature requires the newest version of Flash. You can download it here.

Editor's Note: In this slide lecture, Gregg W. Stone, MD, outlines key considerations before accepting a clinical trial as "positive." See the companion presentation by Stuart Pocock, PhD, The Primary Endpoint Is Negative: What Happens Now?

(Enlarge Slide) (Enlarge Slide)

Hello. I am Gregg Stone from Columbia University Medical Center and the Cardiovascular Research Foundation in New York. Today I am going to talk about clinical trial interpretation. Specifically, what should be done when the primary endpoint of a clinical trial is positive? Is that enough evidence to either adopt a new therapy or to change clinical practice? This is a synthesis of one part of a two-part article[1,2] that Stuart Pocock and I recently wrote and had published in the New England Journal of Medicine on how to interpret clinical trials.

There is a natural tendency to assess the results of randomized clinical trials as either positive or negative according to whether the P value for the primary outcome measure is < .05 or > .05. However, such an interpretation is overly simplistic. The primary endpoint result is just a starting point in the comprehensive evaluation of the totality of the clinical evidence, which includes consideration of secondary endpoints, safety issues; and the size and the quality of the trial.

(Enlarge Slide) (Enlarge Slide)

I am going to talk about how to assess a clinical study for which the primary outcome is positive, a P value < .05. That is good news, and is necessary for a clinical trial to be considered positive in most cases. However, it is really not sufficient. We suggest asking 11 key questions when the primary outcome is positive to know whether it is positive enough. It's a long list; I won't read it. Instead I will give you examples that we used to illustrate these points in the New England Journal of Medicine article.[1]

(Enlarge Slide) (Enlarge Slide)

First, is a P value < .05 strong enough evidence to adopt a new therapy? You have to realize that a P of .05 represents about a 5% risk for a false-positive result. Proof beyond a reasonable doubt requires a much smaller P value (eg, < .001). A P value of .01 to .049 is some evidence that the trial is positive, but it is not really conclusive.

For example, take the PARADIGM-HF trial[3] of a new neprilysin inhibitor and angiotensin receptor blocker versus enalapril in more than 8000 patients with heart failure (HF). Looking at the primary endpoint of cardiovascular (CV) death or HF hospitalization, the new agent reduced that endpoint by 20% compared with placebo. The P value is < .001 and there are also individual reductions in all-cause death, CV death, and HF hospitalizations, each of which also had P values < .001. This is overwhelming evidence of benefit. Therefore, regulatory approval, both in the United States and Europe, came from this single randomized trial.

(Enlarge Slide) (Enlarge Slide)

In contrast, consider the SAINT-1 trial[4] of a free radical–trapping agent versus placebo in about 1700 patients with acute ischemic stroke. The primary endpoint was the modified Rankin score (mRS) for disability at 90 days. If you look at the overall distribution, there was some evidence of benefit. The P value is .038 by the Cochran-Mantel-Haenszel test, suggesting that there was about a 20% odds ratio (OR) of improvement in disability scores. This, in and of itself, with a P value just barely < .05, was not enough for regulatory approval. The regulators said, "We need more data; do a larger trial."

(Enlarge Slide) (Enlarge Slide)

They ran the almost identical SAINT-2 trial[5] in 3306 patients with acute ischemic stroke, using a very similar endpoint. The P value was .33, a negative trial. The conclusion was that this new agent was ineffective for acute ischemic stroke. The regulators were right in saying that there was just not enough evidence and they needed more.

(Enlarge Slide) (Enlarge Slide)

The second question—and this is very important—is whether the magnitude of the treatment benefit is clinically relevant. The more patients you study, the more likely you might demonstrate a clinical benefit if one exists. If it is a very small clinical benefit, then it may not be worth the cost or potential complications. In this case, it is helpful to consider the absolute treatment effects as well as the relative treatment effects—for example, the number needed to treat (NNT) analysis.

The now classic example is IMPROVE-IT,[6] ezetimibe versus placebo in nearly 18,000 patients with acute coronary syndrome (ACS), all of whom were treated with simvastatin. The primary endpoint of CV death, myocardial infarction (MI), unstable angina, revascularization, or stroke was reduced by adding ezetimibe to simvastatin. It was a 6% reduction, the P value was .02, and it took 7 years for this trial to become positive. Over that 7-year period, there was a 2% absolute difference in event rates (about 0.3% per year). It was fair to ask whether the benefit of ezetimibe is large enough to warrant its cost and potential complications. A US Food and Drug Administration (FDA) advisory panel voted no.

(Enlarge Slide) (Enlarge Slide)

The third question is whether the primary outcome is clinically important. There are several ways to look at this. First, consider the issue of surrogate endpoints. To make trials smaller, we often use surrogate endpoints that are not clinical endpoints but other mechanistic or biochemical measures that often correlate with an improved clinical outcome. For example, hemoglobin A1c is a measure of diabetic control. The ACCORD trial[7] was a study of intensive versus standard glucose-lowering therapy in more than 10,000 patients with type 2 diabetes. The intensive therapy markedly reduced hemoglobin A1c levels. There was no doubt about this with a P value < .0001.

On the other hand, the primary endpoint of CV death, MI, or stroke was not significantly improved and, in fact, mortality increased. This shows that you cannot always—in fact, often you cannot—rely on surrogate endpoints. We really need clinical evidence of benefit before we adopt new therapies.

(Enlarge Slide) (Enlarge Slide)

You also have to look very carefully at composite clinical endpoints. The reason we use composite endpoints, even when they are clinically based, is to reduce sample size. This assumes that the mechanism of benefit for each of the composite components is the same, and that they will all go in the same direction.

For example, consider the EXPEDITION trial[8] of an intravenous sodium-hydrogen exchange inhibitor, cariporide, in patients undergoing bypass surgery. The primary endpoint of death or MI at 5 days was significantly reduced with this new agent, but you have to look at the components. In fact, mortality was increased while MI was decreased. Obviously, the increase in mortality does not warrant adopting this drug, even though the primary endpoint was reduced.

Question 4 is, are the secondary outcomes supportive? Confidence in the totality of the evidence is enhanced if prespecified secondary outcomes, in addition to the primary endpoint, are positive.

(Enlarge Slide) (Enlarge Slide)

For example, I told you about the SAINT-1 trial of a free radical–trapping agent versus placebo in acute ischemic stroke. The mRS was borderline positive in the first trial, with a P value of .08.

There were other secondary endpoints, such as the National Institutes of Health Stroke Score (NIHSS) and the Barthel index, both of which were negative. This was additional reason to require more data rather than just approving the trial on the basis of its primary endpoint.

The primary endpoint was mildly positive, but the secondary endpoints were not, raising doubts and reinforcing the need for the second, larger trial, which as we saw earlier was negative.

(Enlarge Slide) (Enlarge Slide)

In contrast, look at the EMPA-REG OUTCOME trial[9] of empagliflozin versus placebo in 7000 patients with type 2 diabetes. The primary endpoint was CV death, MI, or stroke, with a median follow-up of 3 years. There was borderline evidence of efficacy. You can see a reduction of 1.6% with a P value of .04. The other prespecified endpoints (CV death, all-cause mortality, and HF hospitalization) were all strongly reduced in favor of the new agent. These secondary endpoints now lend credence to the utility and benefit of that borderline primary endpoint.

(Enlarge Slide) (Enlarge Slide)

Question 5 is whether the findings are consistent across important subgroups. You have to be very careful when interpreting subgroup data because spurious findings can arise when multiple subgroups are analyzed. Nonetheless, we can learn a lot from looking at subgroups, especially if biologic plausibility is present. A great example of this was the PLATO trial[10] of ticagrelor versus clopidogrel in more than 18,000 patients with ACS.

If you look at the primary endpoint of CV death, MI, or stroke at 3-year follow-up, most subgroups were very positive except for one. The benefit depended on the chronic dose of aspirin that was used. If aspirin was used at a low dose (< 300 mg), the benefit of ticagrelor in terms of reducing ischemic events was very robust. In a relatively small subgroup of patients taking a high dose of aspirin, there was actually harm in using ticagrelor. The interaction between these two relative risks was very strongly positive, with P well below .0001.

(Enlarge Slide) (Enlarge Slide)

No one has been able to figure out the biologic plausibility of this interaction, but because it was so strong, the FDA placed a boxed warning when ticagrelor was approved, saying that if you are going to use ticagrelor, you have to use a low dose of chronic aspirin, and they recommended < 100 mg.

(Enlarge Slide) (Enlarge Slide)

A very important point is question 6: Is the trial large enough to be convincing? Small trials lack power, and positive treatment effects are susceptible to exaggeration. False positives—called type I errors—can occur. You need a large trial, usually a multicenter trial, to be convincing. For example, a well-known randomized controlled trial (RCT) of n-acetyl-cysteine versus placebo to prevent contrast-induced nephropathy was performed in 83 patients with chronic kidney disease (CKD). As published in the New England Journal of Medicine,[11] there was a 90% reduction in contrast nephropathy with n-acetyl-cysteine. This led to the conclusion, in the journal, that n-acetyl-cysteine prevents the reduction in renal function induced by iopromide. In retrospect, this is much too strong a claim. A 90% treatment effect is not biologically plausible. Almost nothing has such a large treatment effect. Subsequently, a meta-analysis[12] of multiple, small to moderately sized RCTs of n-acetyl-cysteine have not consistently shown efficacy. In retrospect (and I think the journal would do this today), saying that n-acetyl-cysteine may reduce contrast nephropathy would be a better conclusion based on this unreasonable large treatment effect which would have prompted more confirmatory trials.

(Enlarge Slide) (Enlarge Slide)

If the trial is positive, you also have to ask, was it stopped early? Trials can be stopped early because of slow recruitment, in which case you have to be very careful about selection bias. The population that was studied may not be representative. You also have to be careful if the trial was stopped early because of markedly positive interim results. It is not uncommon to stop on what is called a random high. If, for example, the truth is a 30% reduction, as you are rolling the dice to get to that 30% reduction (in other words, studying a small number of patients), early data may come in at a 20% or 5% reduction, or even a 50%-60% reduction. Trials are often stopped early because of a marked treatment benefit. Usually you are stopping with a higher treatment benefit than is the actual truth.

A good example is the FAME 2 trial,[13] percutaneous coronary intervention (PCI) versus medical therapy in patients with positive fractional flow reserve lesions consistent with ischemia. The primary endpoint was death, MI, or urgent revascularization, and this trial was stopped early by the data and safety monitoring board (DSMB) after only 888 patients were enrolled because of overwhelming evidence of benefit.

(Enlarge Slide) (Enlarge Slide)

Looking at which component actually drove the benefit, it was primarily urgent revascularization—which, you can argue, is a relatively soft outcome in a unblinded trial. In contrast, neither death nor MI was significantly reduced, but there was a trend toward a reduction. Had this trial been allowed to go to its natural completion, the reduction in death or MI might have become significant, which would have added a great deal of benefit to this study.

(Enlarge Slide) (Enlarge Slide)

A very important point: When you are trying to analyze a positive primary endpoint, look not only at efficacy but also at the opposite arm, which is safety. Do safety concerns counterbalance positive efficacy? You need to consider the absolute benefits and risks. You have to consider the NNT for benefit versus the number needed to harm (NNH) for a safety concern. This may provide a guide to net clinical benefit. Let us take two examples. One is the SPRINT trial[14] of intensive versus standard blood pressure (BP) control in over 9000 patients with moderate hypertension. There was a 0.6%-1.6% absolute reduction (with intensive BP control) in major adverse cardiac events (MACE), CV death, and severe HF.

Counterbalancing these benefits was a 0.6%-1.6% increase in hypotension, syncope, electrolyte abnormalities, and acute renal insufficiency. How do you know whether patients overall benefited or were harmed by this therapy? Perhaps the arbiter is absolute mortality. In this trial, there was a highly statistically significant 1.2% reduction in absolute mortality. The NNT was 90-95 to save one life. Therefore, the net effect of intensive BP control is beneficial if intolerable side effects do not occur.

(Enlarge Slide) (Enlarge Slide)

In contrast, look at the DAPT trial[15] of 12 versus 30 months of clopidogrel in 9661 aspirin-treated patients after a drug-eluting stent (DES). This is extended dual anti-platelet therapy (DAPT). Here, there was clear ischemic benefit, with a 1%-2% absolute reduction in major adverse cardiovascular or cerebrovascular events (MACCE), stent thrombosis, and MI. We have to weigh that against the risks of increasing major bleeding. Do patients overall benefit or harm? There was no significant difference in CV death, but both non-CV death and all-cause mortality were actually increased. This was a borderline increase of .05. Was this chance, or is this a real net clinical harm to these patients? Of course, the controversy has now ensued: Is the net effect of prolonging DAPT beneficial, harmful, or neutral?

(Enlarge Slide) (Enlarge Slide)

Finally, are there flaws in the trial design or conduct? If the results are strongly positive, you have to consider whether there were biases in trial design or conduct before you can accept that there is genuine benefit. For example, the SYMPLICITY-HTN-2 trial[16] of renal denervation in resistant hypertension was markedly positive, suggesting that this therapy reduces BP. This was an open-label trial, which can introduce placebo effects, Hawthorne effects, ascertainment bias, and other issues.

(Enlarge Slide) (Enlarge Slide)

This was supplanted by the SYMPLICITY-HTN-3[17] sham-controlled trial of renal denervation in 535 patients with resistant hypertension, showing no significant reduction in BP. The control arm had a significant reduction in addition to the treatment arm, again showing how powerful both the placebo and Hawthorne effects might be. Blinding trials is very important to eliminate bias.

(Enlarge Slide) (Enlarge Slide)

You can look at another study, such as the ATLAS ACS 2-TIMI 51 trial[18] of rivaroxaban versus placebo in more than 15,000 patients with ACS. This was a positive trial, with 16% relative reduction in the hazard of the CV death, MI, or stroke at 2-year follow-up. However, almost 28% of the patients discontinued the treatment early, 8% withdrew consent, and 7.2% had missing vital data. Due to this uncertainty, the FDA did not grant an indication for ACS to this agent.

(Enlarge Slide) (Enlarge Slide)

You have to ask: Do the findings apply to your patients? You must carefully examine not only the inclusion criteria but also the characteristics of the patients actually enrolled—their background therapies and the geographies of practice to determine whether the results are relevant. For example, with intensive versus standard BP lowering in the SPRINT trial,[14] there was a reduction of death or HF, but the trial excluded patients with diabetes. In contrast, the ACCORD BP trial[19] was very similar but enrolled only patients with type 2 diabetes. There was no reduction in mortality or HF, but there was a reduction in stroke.

The question comes up whether these varying results are due to differences in patient characteristics, trial methods, background therapies, chance, or other factors.

(Enlarge Slide) (Enlarge Slide)

You also have to be careful about single-center trials. They must be viewed very cautiously. Center-specific effects, such as particular systems of care and background therapies used, may not be relevant to many other centers; this precludes generalizability. Single-center trials also often lack quality-control measures. Results from single-center trials, even those with a reasonable sample size, should rarely serve as the basis for changing guidelines unless validated in subsequent multicenter trials.

(Enlarge Slide) (Enlarge Slide)

For example, we all remember the TAPAS TRIAL,[20] a single-center trial of more than 1000 patients, looking at thrombus aspiration during primary PCI after ST elevation MI (STEMI). It showed a very substantial and impressive reduction in mortality at 1 year with thrombus aspiration. The primary endpoint of this trial was improved ST segment resolution and myocardial blush, which were modestly improved with thrombus aspiration. There was no reduction in infarct size, in terms of suggesting a mechanism that might reduce mortality. Nonetheless, this therapy was widely adopted and ultimately led to two very large outcomes trials, TASTE[21] and TOTAL,[22] in more than 17,000 patients. These two trials showed no clinical benefit for routine use of thrombus aspiration. Beware that this is an unreasonable treatment effect; there was no mechanistic or plausible benefit. Be very careful about single-center trials because, as in this case, the trial may be underpowered for mortality reduction.

(Enlarge Slide) (Enlarge Slide)

Finally, by the time the trial is reported, has the experimental treatment been surpassed by new therapies? Therefore, it may not be relevant to the way you practice. Good examples would be the SYNTAX[23] and FREEDOM[24] trials, which demonstrated superiority of bypass surgery to first-generation DES. Since then, there had been multiple studies of newer contemporary stents—for example, fluoropolymer-based everolimus-eluting stents (EES) compared with [the first-generation] paclitaxel-eluting stents (PES) demonstrating reductions in both ischemic target lesion revascularization (TLR),[25] showing better efficacy, and stent thrombosis,[26] showing better safety and a 70% reduction in stent thrombosis. That is how we are practicing today. We no longer use the PES that led to the early differences in SYNTAX and FREEDOM. In fact, we just saw the EXCEL trial[27] of patients with unprotected left main disease with simple and moderately complex coronary disease. It showed noninferiority of everolimus-eluting stents to coronary artery bypass graft (CABG) surgery.

(Enlarge Slide) (Enlarge Slide)

In conclusion, if the primary endpoint is positive, you have to ask yourself, is that good enough? A significance level of 5% for the primary endpoint is the minimum requirement for a trial to be declared positive. If achieved, it should prompt deeper inspection into all study processes and outcomes, including safety. It must also be determined whether the findings translate into net clinical benefit and cost-effectiveness in real-world patients. Ultimately, the reason it is important for you to understand this, as well as Stuart Pocock's presentation on what to do when the primary endpoint is P > .05, is that physicians at the point of care bear the ultimate responsibility for accurately interpreting clinical trial results and for integrating regulatory and guideline recommendations to make the best treatment decisions for each patient in their care. You cannot just look at the abstract; you have to dig deep into the article and understand the nuances of clinical trial design and outcome interpretation.

I hope that this lecture and the one given by Stuart Pocock have helped you in this respect. Thank you very much.

Comments

3090D553-9492-4563-8681-AD288FA52ACE
Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as:

processing....