Sample Size Calculation? Well, It's a Living
A rough outline of a typical statistical collaboration between an investigator and a statistician:
Investigator: How many patients do I need for my study?
Statistician: Perhaps we could think about the study design ...
Investigator: Whatever. Just tell me how many patients I need.
[Gap of 3 years: Statistician writes a brilliant theory paper for Biometrika]
Investigator: Right, here is the data set. What is the P value?
Statistician: Oh, hello again. Look, I have some concerns about how the endpoint was assessed...
Investigator: Fine, I'll add a sentence to the discussion. Just tell me the P value, okay?
I dealt with P values in a previous article in this series (please see "Related Links"), so I'll focus on the other of the only two things that investigators seem to think that statisticians provide, sample size calculations (sometimes called "power" calculations). Here is the easy bit: you give me some numbers, I plug them into a formula and tell you how many patients you need. The formula has a bunch of Greek letters, but in principle it's quite straightforward. Investigators are usually trying to detect some kind of difference, such as between pain scores in a trial of a drug versus a placebo, survival rates in a comparison of two chemotherapy regimens, or cancer recurrence in patients with high versus lowexpression levels of a protein. The bigger the difference you are looking for, the easier it is to see, and the fewer patients you need: you'd work out pretty quickly that people jumping out of airplanes have better survival with parachutes (>99%) than without (<1%); working out whether one type of parachute is better than another is going to take more extensive research.
The relationship between sample size and the difference you consider worth looking for follows the inverse square law: if you halve the size of the difference, you quadruple the number of patients you need. This relationship also applies to another statistic sometimes needed for sample size calculations, which relates to variation. Take the case of the pain trial, where we are trying to see whether a drug lowers pain scores. If pretty much everyone has a pain score of around 5 out of 10, it is going to be relatively easy to see whether the drug is lowering pain (eg, if the pain score in the drug group is 4). On the other hand, if pain scores are all over the place  some patients have a score of 9, others a 3  it is more difficult to tell whether differences between groups might have occurred by chance.
The Sample Size Samba
The problem with sample size calculation is that you often get the inverse square law working twice over. Say an investigator comes along and says "We expect pain score to be around 5 in these patients and I'd like to see it go down to 4 with the drug. Standard deviation [a measure of the degree to which different patients have different pain scores] is around 1." I plug these numbers into my formula and get a total sample size of 44, at which point everyone gets excited because this means that we can get the trial done by Christmas, and have our New England Journal of Medicine paper published well in time for the departmental review next year. But then a colleague points out that the drug is very safe and inexpensive and would be worth giving if it reduced pain scores by only half a point; moreover, didn't the recent paper by Bloggs and colleagues show a standard deviation of 2? Now when I run the numbers I get a required sample size of 774. This kicks off what Ken Schulz, a wellknown trial methodologist, has called "the sample size samba": we can't possibly do a trial with 774 patients, but hang on  who is to say that Bloggs is right, and what if the standard deviation was 1.5? Oh, we'd need 380 patients, which is still too many. What if we change the betweengroup difference to 0.75?. And now the sample size calculation spits out 170 patients, which is just about doable (if not in time for departmental review), so we agree on that.
I wouldn't entirely blame the investigators: sample size calculations are often pushed most heavily by grant and ethical review committees. This sometimes makes sense. If we are investigating a new chemotherapy drug, we don't want to give it to lots of patients if we can tell it is ineffective by treating only a few. In addition we don't want to go to the time and trouble of setting up a trial, and then waiting years for the results to come in, if the trial isn't big enough to tell us one way or another whether the drug works. But it is hard to worry too much about sample size for a preliminary study of a qualityoflife questionnaire or a simple blood test, especially in light of the very real uncertainties about the differences that would be important or the sort of variation we might expect.
One version of the "sample size samba" goes like this: an investigator submits a trial or grant; the review committee demands a sample size calculation; the investigators explain why doing so wouldn't be valid; the review committee sticks to its guns; the statistician ends up sticking some arbitrarily chosen numbers into a sample size formula to get a sample size close (but not too close) to what was originally planned; the review committee commends the investigators; and the study goes forward.
The True Value of Sample Size Calculation
You might not have guessed it from what I have written so far, but as it happens, I am a big fan of sample size calculation. In short, formal sample size calculation can help us think through what patientbased research can and cannot tell us about medicine. Some stories:

In sum, there are lots of things we'd like to know, but if we only have a few patients, or are looking for very small effects, it is unlikely we'll ever find out for sure. This raises a far bigger question than an institutional review board's request for a precise estimate of how many blood samples a researcher will need to explore some preliminary ideas. Thinking about what we can and can't find out with research  and what we should do in the absence of clear evidence  could not be more central to what we do as medical researchers.
Some Additional Notes for Keen Readers
Two key concepts in calculating a sample size are alpha and power. Alpha is the risk that a wellconducted study would conclude that there is an effect when none in fact exists. This is the P value below which we declare results to be "statistically significant." As such, alpha is almost always set at 5%. Power is the probability that, if there is an effect of a given size, a good trial will find it; it is typically set at 80% or 90%.
To understand power a little more, let's say you had a very effective drug that cured 99% of patients with an otherwise incurable disease (1% survival rate). A trial in which 100 patients received the drug and 100 received placebo would usually end up with a single survivor in the control group and a single death in the drug group. But although extremely unlikely, it is not impossible that you could get a 50:50 survival rate in both groups and fail to find your drug effective. So with any study, you have to take the risk that you'll end up with no difference between groups, no matter how big the true effect size. You can calculate that risk by using sample size formulae. Power is 1 minus your risk for failure: if you have a 10% risk of failing to find a true effect of a given size, your power is 90%.
Medscape Business of Medicine © 2008 Medscape
Cite this: Let's Dance! The Sample Size Samba  Medscape  Dec 16, 2008.
Comments