|
One of the questions I'm almost always asked when I give talks on
statistics, whether to journalists or physicians, is how large a study
has to be before it should be considered worth writing about or making
treatment decisions on. All of my audiences seem to want a quick, easy
number to latch onto so they can toss the unimportant studies and spend
more time with the critical ones.
I'm always amazed that my audience isn't aware of that magic number. So
to save you, dear reader, some time, here it is: 267. Studies with 266
subjects just don't cut it, but 268 means it's definitely a good study.
What, you doubt me? Well, you should. As I always tell my audiences,
there is no magic number. Tiny studies can sometimes lead to perfectly
valid results, and massive studies can sometimes be meaningless. So how
do you tell the difference?
The Power of Power
The way to tell the difference is to look for evidence of a power
calculation—also referred to as sample size calculation. Before a study
begins, the authors determine the probability that the design and size
of a trial will make it possible to accurately reject a null hypothesis
if it is false. In English, a null hypothesis says something like "the
intervention has no effect" or "there is no difference between groups."
In other words, a power calculation predicts the chance that a real
effect will be picked up. (You may recall from biostatistics courses
that finding a false negative is a type II error, while a false positive
is a type I error).
By convention, 80% is the threshold. A power of 80% at a specified
significance level—say P<0.05—for a study of a given
size means that 80% of the studies of that size will have statistically
significant results if the hypothesized difference is true. At or above
this level, a study is adequately powered, while below it, it isn't.
Even if power calculations are performed, however, it's not always easy
to find them in a study, even if you know where to look. This is
particularly likely to be the case for trials with negative results—ones
that show no statistically significant difference between the groups
being compared. In a study published last year, a group at Adelaide,
Australia's Queen Elizabeth Hospital looked at 228 prospective,
randomized controlled trials in rheumatology published between 2001 and
2002 (J Rheumatol 2005;32:2083-2088). After assuming that the
studies showing a positive result were adequately powered, they found
that just 37 of the 86 with a negative or "indeterminate" result had
bothered to report a sample size calculation, although all but four of
those 37 turned out to have adequate power. In the remaining 49, when
the authors did their own sample size calculations, only 10 passed
muster for adequate size. That meant that half of the trials with
negative or indeterminate results were inadequately powered.
Inadequate powering was also frequent when researchers looked at acute
stroke trials (Stroke 2004;35:1216-1224) and surgery trials (
Surgery 2003;134:275-279). And in two reviews, researchers have
found the same was true in studies of head injury (BMJ
2000;320:1308-1311) and orthopedics (J Bone Joint Surg Am
1999;81:1454-1460). Perhaps more alarming, in the head injury and
orthopedics studies, inadequate powering affected all studies, not just
the negative ones.
There are many ethical questions raised when trials are not designed to
have adequate power (for a review, see JAMA 2002;288:358-362).
But the overarching one is this: why are we subjecting people to being
poked and prodded when the study is doomed to fail from the start? The
flip side of that is also an issue: Why have studies that are
overpowered? If 987 patients are enough, why enroll 1,113? One could
argue that such studies make up for loss to follow-up and allow
investigation of rarer effects, but the ethical concerns must be kept in
mind.
Beyond important ethical considerations, however, are problems of
interpretation: It's very difficult to justify using the results of an
underpowered study to make treatment decisions. Note that a study can
describe many different valid power analyses—for the main effect, for
secondary outcomes, for the sample as a whole and for patient subgroups.
Of course, studies may be underpowered to find differences in particular
subgroups, but that doesn't make their overall findings irrelevant, if
there is sufficient power to reach conclusions regarding the overall
group.
The problem only arises when press releases—or authors—try to do too
much with the results. If a peer reviewer or journal editor is keeping
an author honest by instructing her to write something along the lines
of "our study was underpowered to detect a significant difference in
this subgroup," then ignore whatever hand-waving conclusion they draw
about that subgroup.
The Rare Condition
Okay, you're saying, it's easy to design an adequately powered study for
common conditions, but what about when it's impossible to have a large
trial because a condition is very rare? Sample size calculations help
you here: It turns out that the rarer a condition, the larger the
hypothesized difference or effect size has to be for a study to have
adequate statistical power—to show a significant difference between
groups.
A good rule of thumb is that the more common the condition, the more
subjects needed. If you're looking for a statistically significant
difference in a trial of a rare cancer that affects only a few thousand
people per year, a trial with less than 20 people in each arm may be
enough. If you're looking for a difference in people with hypertension,
however, which affects millions, you'll probably need at least a few
hundred subjects.
Another way to think about it is that what you're really doing is
narrowing the confidence interval (see earlier column "The 5% Solution
to Reading Studies With Confidence," May 2005, page 19). Studies with
smaller sample sizes—whether of a rare condition or otherwise—will have
larger confidence intervals around their findings, all else being equal.
If you have a good sense of the variability in a patient sample—which is
more likely in a common condition—you can state the results with more
assurance because a larger study will have a smaller range of possible
results within the confidence interval.
Keep in mind, however, that such small sample sizes for rare conditions
aren't automatically okay. A power calculation will still determine
whether you have a large enough sample size. Study designers sometimes
argue that it's still ethical and useful to perform small sample size
trials on such rare conditions, even when they haven't shown the study
to be adequately powered, because they can later gather several of those
studies into a meta-analysis with greater power. By now, however, we all
know the flaws in meta-analyses, so I say, why not just do the proper
study first so no one will doubt it? The additional subjects needed may
be only a handful.
Thanks to Howard Barkan, a biostatistician and research methodologist
in the Department of Surgery, Kaiser Permanente, Oakland, California,
for his assistance.
|