Andrew J. Vickers, PhD


December 16, 2008

Sample Size Calculation? Well, It's a Living

A rough outline of a typical statistical collaboration between an investigator and a statistician:

Investigator: How many patients do I need for my study?
Statistician: Perhaps we could think about the study design ...
Investigator: Whatever. Just tell me how many patients I need.
[Gap of 3 years: Statistician writes a brilliant theory paper for Biometrika]
Investigator: Right, here is the data set. What is the P value?
Statistician: Oh, hello again. Look, I have some concerns about how the endpoint was assessed...
Investigator: Fine, I'll add a sentence to the discussion. Just tell me the P value, okay?

I dealt with P values in a previous article in this series (please see "Related Links"), so I'll focus on the other of the only two things that investigators seem to think that statisticians provide, sample size calculations (sometimes called "power" calculations). Here is the easy bit: you give me some numbers, I plug them into a formula and tell you how many patients you need. The formula has a bunch of Greek letters, but in principle it's quite straightforward. Investigators are usually trying to detect some kind of difference, such as between pain scores in a trial of a drug versus a placebo, survival rates in a comparison of two chemotherapy regimens, or cancer recurrence in patients with high versus low-expression levels of a protein. The bigger the difference you are looking for, the easier it is to see, and the fewer patients you need: you'd work out pretty quickly that people jumping out of airplanes have better survival with parachutes (>99%) than without (<1%); working out whether one type of parachute is better than another is going to take more extensive research.

The relationship between sample size and the difference you consider worth looking for follows the inverse square law: if you halve the size of the difference, you quadruple the number of patients you need. This relationship also applies to another statistic sometimes needed for sample size calculations, which relates to variation. Take the case of the pain trial, where we are trying to see whether a drug lowers pain scores. If pretty much everyone has a pain score of around 5 out of 10, it is going to be relatively easy to see whether the drug is lowering pain (eg, if the pain score in the drug group is 4). On the other hand, if pain scores are all over the place -- some patients have a score of 9, others a 3 -- it is more difficult to tell whether differences between groups might have occurred by chance.

The Sample Size Samba

The problem with sample size calculation is that you often get the inverse square law working twice over. Say an investigator comes along and says "We expect pain score to be around 5 in these patients and I'd like to see it go down to 4 with the drug. Standard deviation [a measure of the degree to which different patients have different pain scores] is around 1." I plug these numbers into my formula and get a total sample size of 44, at which point everyone gets excited because this means that we can get the trial done by Christmas, and have our New England Journal of Medicine paper published well in time for the departmental review next year. But then a colleague points out that the drug is very safe and inexpensive and would be worth giving if it reduced pain scores by only half a point; moreover, didn't the recent paper by Bloggs and colleagues show a standard deviation of 2? Now when I run the numbers I get a required sample size of 774. This kicks off what Ken Schulz, a well-known trial methodologist, has called "the sample size samba": we can't possibly do a trial with 774 patients, but hang on -- who is to say that Bloggs is right, and what if the standard deviation was 1.5? Oh, we'd need 380 patients, which is still too many. What if we change the between-group difference to 0.75?. And now the sample size calculation spits out 170 patients, which is just about doable (if not in time for departmental review), so we agree on that.

I wouldn't entirely blame the investigators: sample size calculations are often pushed most heavily by grant and ethical review committees. This sometimes makes sense. If we are investigating a new chemotherapy drug, we don't want to give it to lots of patients if we can tell it is ineffective by treating only a few. In addition we don't want to go to the time and trouble of setting up a trial, and then waiting years for the results to come in, if the trial isn't big enough to tell us one way or another whether the drug works. But it is hard to worry too much about sample size for a preliminary study of a quality-of-life questionnaire or a simple blood test, especially in light of the very real uncertainties about the differences that would be important or the sort of variation we might expect.

One version of the "sample size samba" goes like this: an investigator submits a trial or grant; the review committee demands a sample size calculation; the investigators explain why doing so wouldn't be valid; the review committee sticks to its guns; the statistician ends up sticking some arbitrarily chosen numbers into a sample size formula to get a sample size close (but not too close) to what was originally planned; the review committee commends the investigators; and the study goes forward.

The True Value of Sample Size Calculation

You might not have guessed it from what I have written so far, but as it happens, I am a big fan of sample size calculation. In short, formal sample size calculation can help us think through what patient-based research can and cannot tell us about medicine. Some stories:

  1. A group of researchers designed a trial to see whether a novel postural re-education program could help back pain. They estimated that back pain scores would fall from 6 to 5 in a "usual care" control group and thought that it would be worth the time and expense of treatment if pain averaged 3.5 in the treatment group at the end of the trial. They ran their sample size calculations and determined that they needed 38 patients per group. But a grant review committee rejected their application. The reviewers wanted to know whether the effects of postural re-education were specific to the teaching, were related to the use of touch, or were simply a matter of spending time with a caring professional. Accordingly, they recommended a 4-arm trial: patients would receive usual care, informal counseling, a massage, or postural re-education. But suppose that counseling, touch, and teaching each contributed 0.5 point to the total 1.5 improvement in pain seen with postural re-education. This is a difference one third the size, meaning that 9 times as many patients are needed per group. Because the reviewers also doubled the number of groups, the sample size increased 18-fold to more than 1300 patients, which is clearly not feasible for a first trial on a new technique.

  2. A cancer surgeon specialized in a cancer for which adjuvant chemotherapy was normally given. What predicted recurrence in this cancer was well known: the size of the tumor, pathologic grade, and some blood markers. However, most of the studies had been conducted with patients on chemotherapy trials, and the surgeon wanted to know whether the same factors affected outcome in the small number of patients for whom chemotherapy was not indicated. Looking at hospital records, he found 128 such patients, of whom 12 had a recurrence. To think through this sample size, imagine there was some tumor characteristic that was relatively common (eg, found in 20% of patients) and had a large effect on outcome (eg, increasing the risk of recurrence by as much as 2-fold). To have a good chance of showing that this risk factor was associated with outcome, you'd typically need about 1000 patients, 8 times as many as the surgeon had available. The surgeon's study was big enough only to detect a predictor that led to a 6-fold increase in risk, which is very uncommon.

  3. Another cancer surgeon specialized in a procedure that was usually pretty successful, with only a 4% or 5% recurrence rate. Nonetheless, the surgeon thought that variation in the way that lymph nodes were removed might affect outcome and wanted to examine case notes from about 1200 patients to see whether this was indeed the case. If the best method of node removal led to 10% relative decrease in recurrence rates, that would obviously be important to know. But detecting a difference of this magnitude would require close to 60,000 patients at least.

In sum, there are lots of things we'd like to know, but if we only have a few patients, or are looking for very small effects, it is unlikely we'll ever find out for sure. This raises a far bigger question than an institutional review board's request for a precise estimate of how many blood samples a researcher will need to explore some preliminary ideas. Thinking about what we can and can't find out with research -- and what we should do in the absence of clear evidence -- could not be more central to what we do as medical researchers.

Some Additional Notes for Keen Readers

Two key concepts in calculating a sample size are alpha and power. Alpha is the risk that a well-conducted study would conclude that there is an effect when none in fact exists. This is the P value below which we declare results to be "statistically significant." As such, alpha is almost always set at 5%. Power is the probability that, if there is an effect of a given size, a good trial will find it; it is typically set at 80% or 90%.

To understand power a little more, let's say you had a very effective drug that cured 99% of patients with an otherwise incurable disease (1% survival rate). A trial in which 100 patients received the drug and 100 received placebo would usually end up with a single survivor in the control group and a single death in the drug group. But although extremely unlikely, it is not impossible that you could get a 50:50 survival rate in both groups and fail to find your drug effective. So with any study, you have to take the risk that you'll end up with no difference between groups, no matter how big the true effect size. You can calculate that risk by using sample size formulae. Power is 1 minus your risk for failure: if you have a 10% risk of failing to find a true effect of a given size, your power is 90%.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as: