Whenever I'm asked to give a talk on reading medical studies or
statisticswhether to physicians, journalists or othersI start by
asking the audience for a definition of "statistical significance." (If
I could reach out from this page and ask you that question, dear reader,
I would, so count yourself lucky.) I get all sorts of answers, from
dazed expressions to "there's a big enough sample size," "it's a good
study," and "it's important"and don't assume that the amount of
training an audience has correlates with better answers.
The correct answer is that statistical significance simply means that
the probability that the results occurred by chancethat is, not
because they reflect a true resultis quite small. How small? Well,
usually 5% or less, by convention. Put another way, if you were to
repeat the experiment 100 times, you would get a similar result at least
95 times. Sounds quite arbitrary, doesn't it? Not to mention a bit
unclear. Read on for an explanation of this mysterious and critical 5%.
Deviants and Deviation
Do you know the one about the two standard deviations who walked into a
bar? When they were done drinking, there was only 5% of the tap left.
That isn't a real joke. It made about as much sense as the "no soap,
radio" punch line that ends a variety of jokes about animals in bathtubs
who ask each other to pass the soap. But you'll remember it, just as
you'll remember the dumb radio joke, and it may help you remember
there's a relationship between two standard deviations and 5%. (And if
you found it funny, you may just be in that 5% of humanity that finds me
amusing.)
Those two standard deviations, it turns out, are the source of the 5%
rule. Given a large enough sample, every data set will follow the famous
bell curve, which means that 95% of the data points will fall between
two standard deviations of that mean (Figure). That leaves two "tails"
of 2.5% each on either side, for a total of 5%. Those tails are also
referred to as outliers. Sometimes, it's good to be such an
outlierthink Hank Aaron's 755 home runs, which place him well into the
right end of the bell curve of major league baseball players. If you're
an aged Boston Red Sox fan, you spent 86 years hoping for an outlier
year, and then you got 2004. (In the interest of full disclosure, and to
show that I believe in statistics, I'm not giving up my season tickets
to the 26time World Champion New York Yankees anytime soon.)
Forgetting sports for a moment, you can begin to see how it would be a
bad idea to be a statistical outlier if you're a medical study. Outliers
are just plain unlikely, and if your statistical analysis says there's
something incredibly unlikely about your results, they may not be true.
They may be, to coin another phrase, falsepositives.
How do we express this, statistically, in medical studies? That's the
P value, which is used to say how likely it is that a difference
between two groups was due to chance. When you see P<=0.05,
that means it's 5% or less likely that such a difference was due to
chance.
Researchers use P values at two points in a trial. The first is
when they compare the characteristics of their control and treatment
groups. At this point, you don't want to see much of a difference
between the two groups, so you want to see P values larger than
0.05. At the second point, however, you want to see a difference between
your groups if you want to prove that a treatment had an effect, so
you'll be looking for a P value smaller than 0.05.
To illustrate, let's go back to the Multiple Outcomes of Raloxifene
Evaluation (MORE) trial (Arch Intern Med 2002;162:11401143) that
we've been dissecting in my recent columns. In the study, P
values were used to compare the placebo group with the raloxifene
(Evista, Lilly) groups in terms of age, number of years since menopause,
body mass index and women with at least one prevalent vertebral
fracture. Take age: the overall P value is reported as 0.30. For
body mass index, it's 0.99. So it's statistically okay to say there's no
difference between the groups of women.
On the results side, in a recent study published in The New England
Journal of Medicine (2005;352:777785), researchers found that among
patients given three doses of recombinant activated factor VII for
intracerebral hemorrhage diagnosed by CT within three hours after onset,
"growth in the volume of intracerebral hemorrhage was reduced by 3.3 mL,
4.5 mL and 5.8 mL in the three treatment groups, as compared with that
in the placebo group (P=0.01)." Yup, 0.01 is less than 0.05.
That's statistically significant.
What To Do With "Trends"?
But what about when P values are greater than 0.05? In a study
comparing the use by gastroenterologists of headsets during colonoscopy
with the use of traditional video screens (Gastrointest Endosc
2005;61:301306), researchers reported that "there was a trend toward
increased time to cecum with the headset (9.8 vs 8.0 minutes; P=
0.055)." So close! But no cigar: 0.055 is greater than 0.05, and so this
result is not statistically significant.
The researchers report this as a "trend," which you'll sometimes find is
a word researchers use to couch their nonstatistically significant
findings. To me, that's sort of cheating, even if it's an accepted way
to express "close to statistically significant." To go back to sports
for a minute, if you're called out half a foot from home plate, you're
still out, and your team doesn't go to the World Series. Still, context
is important. If this is a treatment for a terminal illness that doesn't
have any other treatments, maybe it's worth it. But if it's the tenth
treatment for a mild disease, I'd say, pass it up.
What about P values less than 0.05? Sometimes you'll see P=
0.02 or some such figure. Is that better, statistically speaking, than
P= i>0.05? I suppose that if you go strictly on what P values tell you,
such smaller numbers are technically better. But for the purposes of
analyzing studies, once P is at or below 0.05, it's all the same.
There's something elsea sort of warningthat this 5% implies, too. If
we've accepted that 5% of studies show an effect when there isn't really
one, then we're accepting that one in 20 studies may actually be wrong.
There are good reasons for that threshold, having to do with the
standard deviations drinking heavily a few paragraphs earlier. It
suggests, however, that maybe it's not a good idea to rush to change
your management based on a single trial. Conversely, with multiple
trials showing similar results, it's less likely that the first one was
just a fluke.
Having Confidence In Confidence Intervals
The issue of confidence intervals is another important one that's tied
to this 5% rule. Confidence intervals often appear next to risk ratios
and odds ratios, or hazard ratios, all of which I discussed in my last
column (March issue). You've seen notations such as "RR, 1.6; 95% CI,
1.41.8." In English, that's "risk ratio, 1.6; 95% confidence interval,
1.41.8." In understandable English, that's something to the effect of
"the risk of [insert horrible disease] in people who [insert bad habit]
was probably 1.6 times that of the general population, but we're 95%
sure it was between 1.4 and 1.8 times that of the general population."
In other words, the risk may actually be greater than 1.6 times, or
less. You can also think about confidence intervals as similar to the
margins of error of polls and surveysit's the same basic concept.
The quick and dirty guide to tell statistical significance from
confidence intervals is to do what is often referred to as checking
whether the interval crosses 1. As I discussed in the last column, a
risk ratio of 1 means that there's no difference in risk between two
groups; an odds ratio of 1 means the odds of an event is the same in two
groups; and a hazard ratio of 1 means the same thing as a risk ratio of
1, but after a multivariate analysis. So if 1 is part of the 95%, it's
within reason to think that if you repeat this trial, you may find no
difference. That means it's statistically insignificant at a confidence
interval of 95%, which is standard. This can be true if you're trying to
lower the risk, in which case the ratio should always stay lower than 1,
or looking for an increased risk, in which case the ratio should always
be higher than 1.
Again, using the raloxifene/MORE study, women taking raloxifene 60 mg
per day had a 68% decrease in the risk of new clinical vertebral
fractures, and the 95% confidence interval was reported as 20% to 87%.
Those percentages can be expressed as 0.68, 0.20 and 0.87 for the sake
of consistency. So, yes, this finding was statistically significant at
the 95% level; the CI didn't cross 1.
Let's rip an example from the headlines, as it were. In a study of
celecoxib (Celebrex, Pfizer) and heart disease among patients in a
clinical trial of colorectal adenoma prevention, subjects reached "a
composite cardiovascular end point of death from cardiovascular causes,
myocardial infarction, stroke, or heart failure" 1.0% of the time in the
placebo group and 2.3% of the time in the treatment group; the latter
received 200 mg of celecoxib bid (N Engl J Med epublication
ahead of print, 10.1056/NEJMoa050405). This gave a hazard ratio of 2.3,
with a 95% confidence interval of 0.9 to 5.5. See something about those
numbers? Yup, it crossed 1. So, at a 95% confidence interval, the
finding is not statistically significant. Again, although it isn't true
here, researchers may say this showed a "trend" toward significance.
And, just as you may see P values lower than 0.05, you may see
confidence intervals greater than 95%. Again, by convention, 95% is
usually good enough, but sometimes just to prove something closer to
beyond a shadow of a doubt, studies will report 97% or 99%.
I hope these 5% rules have burned into your brain. If they haven't yet,
perhaps you're an outlieror, more likely, I'm an outlier when it comes
to explaining these things clearly. But better that they're clear.
Ask the Expert
Deciphering P Values
Dear Dr. Oransky:
Your "Statistically Speaking" column in Pharmacy Practice
News is wonderful! As a pharmacist still trying to learn about
statistics, I find your information to be simple and easy to understand.
I hope that you continue to put out more articles. A quick question:
I have seen data where a relative risk will have an associated P
value, as shown in the example below, which looks at the higher risk of
producing hyperglycemia when patients are on steroids versus when they
don't have steroids. How and where did they come up with the P
value?
John Vu, PharmD
Pharmacy Practice Resident
Methodist Dallas
Medical Center
JohnVu@mhd.com
Dr. Oransky responds:
Thanks so much for reading, and for the kind words! I've noticed that
P values seem to be creeping into evaluations of risk, hazard and
odds ratios. It's as if people don't think confidence intervals are
enough, although I have yet to see a case in which the confidence
intervals do not cross 1 and yet the P value is greater than 0.05.
I consulted Ted Holford, a Professor of Biostatistics at Yale University
and author of Multivariate Methods in Epidemiology, and here's
what he said about P values and risk ratios: What the P
value tells you in these cases is the probability of observing the data
if the null hypothesiseg, that the relative risk is equal to 1is
true, in much the same way as testing the null hypothesis that two means
are equal for a normally distributed population.
How they calculate it is probably the same way they calculate most
statisticsthat is, they give it to a statistician or they throw all
the data into a statistics program.
Study Group

Fingerstick blood sugars

Fingersticks resulting in hyperglycemia

Unadjusted relative risk (95% CI)

P
value

Steroids

235

149

1.57 (1.381.77)

<0.0001

No steroids

785

317

NA

NA
