One of the questions I'm almost always asked when I give talks on statistics, whether to journalists or physicians, is how large a study has to be before it should be considered worth writing about or making treatment decisions on. All of my audiences seem to want a quick, easy number to latch onto so they can toss the unimportant studies and spend more time with the critical ones.

I'm always amazed that my audience isn't aware of that magic number. So to save you, dear reader, some time, here it is: 267. Studies with 266 subjects just don't cut it, but 268 means it's definitely a good study.

What, you doubt me? Well, you should. As I always tell my audiences, there is no magic number. Tiny studies can sometimes lead to perfectly valid results, and massive studies can sometimes be meaningless. So how do you tell the difference?

The Power of Power

The way to tell the difference is to look for evidence of a power calculation—also referred to as sample size calculation. Before a study begins, the authors determine the probability that the design and size of a trial will make it possible to accurately reject a null hypothesis if it is false. In English, a null hypothesis says something like "the intervention has no effect" or "there is no difference between groups." In other words, a power calculation predicts the chance that a real effect will be picked up. (You may recall from biostatistics courses that finding a false negative is a type II error, while a false positive is a type I error).

By convention, 80% is the threshold. A power of 80% at a specified significance level—say P<0.05—for a study of a given size means that 80% of the studies of that size will have statistically significant results if the hypothesized difference is true. At or above this level, a study is adequately powered, while below it, it isn't.

Even if power calculations are performed, however, it's not always easy to find them in a study, even if you know where to look. This is particularly likely to be the case for trials with negative results—ones that show no statistically significant difference between the groups being compared. In a study published last year, a group at Adelaide, Australia's Queen Elizabeth Hospital looked at 228 prospective, randomized controlled trials in rheumatology published between 2001 and 2002 (J Rheumatol 2005;32:2083-2088). After assuming that the studies showing a positive result were adequately powered, they found that just 37 of the 86 with a negative or "indeterminate" result had bothered to report a sample size calculation, although all but four of those 37 turned out to have adequate power. In the remaining 49, when the authors did their own sample size calculations, only 10 passed muster for adequate size. That meant that half of the trials with negative or indeterminate results were inadequately powered.

Inadequate powering was also frequent when researchers looked at acute stroke trials (Stroke 2004;35:1216-1224) and surgery trials ( Surgery 2003;134:275-279). And in two reviews, researchers have found the same was true in studies of head injury (BMJ 2000;320:1308-1311) and orthopedics (J Bone Joint Surg Am 1999;81:1454-1460). Perhaps more alarming, in the head injury and orthopedics studies, inadequate powering affected all studies, not just the negative ones.

There are many ethical questions raised when trials are not designed to have adequate power (for a review, see JAMA 2002;288:358-362). But the overarching one is this: why are we subjecting people to being poked and prodded when the study is doomed to fail from the start? The flip side of that is also an issue: Why have studies that are overpowered? If 987 patients are enough, why enroll 1,113? One could argue that such studies make up for loss to follow-up and allow investigation of rarer effects, but the ethical concerns must be kept in mind.

Beyond important ethical considerations, however, are problems of interpretation: It's very difficult to justify using the results of an underpowered study to make treatment decisions. Note that a study can describe many different valid power analyses—for the main effect, for secondary outcomes, for the sample as a whole and for patient subgroups. Of course, studies may be underpowered to find differences in particular subgroups, but that doesn't make their overall findings irrelevant, if there is sufficient power to reach conclusions regarding the overall group.

The problem only arises when press releases—or authors—try to do too much with the results. If a peer reviewer or journal editor is keeping an author honest by instructing her to write something along the lines of "our study was underpowered to detect a significant difference in this subgroup," then ignore whatever hand-waving conclusion they draw about that subgroup.

The Rare Condition

Okay, you're saying, it's easy to design an adequately powered study for common conditions, but what about when it's impossible to have a large trial because a condition is very rare? Sample size calculations help you here: It turns out that the rarer a condition, the larger the hypothesized difference or effect size has to be for a study to have adequate statistical power—to show a significant difference between groups.

A good rule of thumb is that the more common the condition, the more subjects needed. If you're looking for a statistically significant difference in a trial of a rare cancer that affects only a few thousand people per year, a trial with less than 20 people in each arm may be enough. If you're looking for a difference in people with hypertension, however, which affects millions, you'll probably need at least a few hundred subjects.

Another way to think about it is that what you're really doing is narrowing the confidence interval (see earlier column "The 5% Solution to Reading Studies With Confidence," May 2005, page 19). Studies with smaller sample sizes—whether of a rare condition or otherwise—will have larger confidence intervals around their findings, all else being equal. If you have a good sense of the variability in a patient sample—which is more likely in a common condition—you can state the results with more assurance because a larger study will have a smaller range of possible results within the confidence interval.

Keep in mind, however, that such small sample sizes for rare conditions aren't automatically okay. A power calculation will still determine whether you have a large enough sample size. Study designers sometimes argue that it's still ethical and useful to perform small sample size trials on such rare conditions, even when they haven't shown the study to be adequately powered, because they can later gather several of those studies into a meta-analysis with greater power. By now, however, we all know the flaws in meta-analyses, so I say, why not just do the proper study first so no one will doubt it? The additional subjects needed may be only a handful.

Thanks to Howard Barkan, a biostatistician and research methodologist in the Department of Surgery, Kaiser Permanente, Oakland, California, for his assistance.