Whenever I'm asked to give a talk on reading medical studies or statistics--whether to physicians, journalists or others--I start by asking the audience for a definition of "statistical significance." (If I could reach out from this page and ask you that question, dear reader, I would, so count yourself lucky.) I get all sorts of answers, from dazed expressions to "there's a big enough sample size," "it's a good study," and "it's important"--and don't assume that the amount of training an audience has correlates with better answers.





















The correct answer is that statistical significance simply means that the probability that the results occurred by chance--that is, not because they reflect a true result--is quite small. How small? Well, usually 5% or less, by convention. Put another way, if you were to repeat the experiment 100 times, you would get a similar result at least 95 times. Sounds quite arbitrary, doesn't it? Not to mention a bit unclear. Read on for an explanation of this mysterious and critical 5%.

Deviants and Deviation

Do you know the one about the two standard deviations who walked into a bar? When they were done drinking, there was only 5% of the tap left.

That isn't a real joke. It made about as much sense as the "no soap, radio" punch line that ends a variety of jokes about animals in bathtubs who ask each other to pass the soap. But you'll remember it, just as you'll remember the dumb radio joke, and it may help you remember there's a relationship between two standard deviations and 5%. (And if you found it funny, you may just be in that 5% of humanity that finds me amusing.)

Those two standard deviations, it turns out, are the source of the 5% rule. Given a large enough sample, every data set will follow the famous bell curve, which means that 95% of the data points will fall between two standard deviations of that mean (Figure). That leaves two "tails" of 2.5% each on either side, for a total of 5%. Those tails are also referred to as outliers. Sometimes, it's good to be such an outlier--think Hank Aaron's 755 home runs, which place him well into the right end of the bell curve of major league baseball players. If you're an aged Boston Red Sox fan, you spent 86 years hoping for an outlier year, and then you got 2004. (In the interest of full disclosure, and to show that I believe in statistics, I'm not giving up my season tickets to the 26-time World Champion New York Yankees anytime soon.)

Forgetting sports for a moment, you can begin to see how it would be a bad idea to be a statistical outlier if you're a medical study. Outliers are just plain unlikely, and if your statistical analysis says there's something incredibly unlikely about your results, they may not be true. They may be, to coin another phrase, false-positives.

How do we express this, statistically, in medical studies? That's the P value, which is used to say how likely it is that a difference between two groups was due to chance. When you see P<=0.05, that means it's 5% or less likely that such a difference was due to chance.

Researchers use P values at two points in a trial. The first is when they compare the characteristics of their control and treatment groups. At this point, you don't want to see much of a difference between the two groups, so you want to see P values larger than 0.05. At the second point, however, you want to see a difference between your groups if you want to prove that a treatment had an effect, so you'll be looking for a P value smaller than 0.05.

To illustrate, let's go back to the Multiple Outcomes of Raloxifene Evaluation (MORE) trial (Arch Intern Med 2002;162:1140-1143) that we've been dissecting in my recent columns. In the study, P values were used to compare the placebo group with the raloxifene (Evista, Lilly) groups in terms of age, number of years since menopause, body mass index and women with at least one prevalent vertebral fracture. Take age: the overall P value is reported as 0.30. For body mass index, it's 0.99. So it's statistically okay to say there's no difference between the groups of women.

On the results side, in a recent study published in The New England Journal of Medicine (2005;352:777-785), researchers found that among patients given three doses of recombinant activated factor VII for intracerebral hemorrhage diagnosed by CT within three hours after onset, "growth in the volume of intracerebral hemorrhage was reduced by 3.3 mL, 4.5 mL and 5.8 mL in the three treatment groups, as compared with that in the placebo group (P=0.01)." Yup, 0.01 is less than 0.05. That's statistically significant.

What To Do With "Trends"?

But what about when P values are greater than 0.05? In a study comparing the use by gastroenterologists of headsets during colonoscopy with the use of traditional video screens (Gastrointest Endosc 2005;61:301-306), researchers reported that "there was a trend toward increased time to cecum with the headset (9.8 vs 8.0 minutes; P= 0.055)." So close! But no cigar: 0.055 is greater than 0.05, and so this result is not statistically significant.

The researchers report this as a "trend," which you'll sometimes find is a word researchers use to couch their non-statistically significant findings. To me, that's sort of cheating, even if it's an accepted way to express "close to statistically significant." To go back to sports for a minute, if you're called out half a foot from home plate, you're still out, and your team doesn't go to the World Series. Still, context is important. If this is a treatment for a terminal illness that doesn't have any other treatments, maybe it's worth it. But if it's the tenth treatment for a mild disease, I'd say, pass it up.

What about P values less than 0.05? Sometimes you'll see P= 0.02 or some such figure. Is that better, statistically speaking, than P= i>0.05? I suppose that if you go strictly on what P values tell you, such smaller numbers are technically better. But for the purposes of analyzing studies, once P is at or below 0.05, it's all the same.

There's something else--a sort of warning--that this 5% implies, too. If we've accepted that 5% of studies show an effect when there isn't really one, then we're accepting that one in 20 studies may actually be wrong. There are good reasons for that threshold, having to do with the standard deviations drinking heavily a few paragraphs earlier. It suggests, however, that maybe it's not a good idea to rush to change your management based on a single trial. Conversely, with multiple trials showing similar results, it's less likely that the first one was just a fluke.

Having Confidence In Confidence Intervals

The issue of confidence intervals is another important one that's tied to this 5% rule. Confidence intervals often appear next to risk ratios and odds ratios, or hazard ratios, all of which I discussed in my last column (March issue). You've seen notations such as "RR, 1.6; 95% CI, 1.4-1.8." In English, that's "risk ratio, 1.6; 95% confidence interval, 1.4-1.8." In understandable English, that's something to the effect of "the risk of [insert horrible disease] in people who [insert bad habit] was probably 1.6 times that of the general population, but we're 95% sure it was between 1.4 and 1.8 times that of the general population." In other words, the risk may actually be greater than 1.6 times, or less. You can also think about confidence intervals as similar to the margins of error of polls and surveys--it's the same basic concept.

The quick and dirty guide to tell statistical significance from confidence intervals is to do what is often referred to as checking whether the interval crosses 1. As I discussed in the last column, a risk ratio of 1 means that there's no difference in risk between two groups; an odds ratio of 1 means the odds of an event is the same in two groups; and a hazard ratio of 1 means the same thing as a risk ratio of 1, but after a multivariate analysis. So if 1 is part of the 95%, it's within reason to think that if you repeat this trial, you may find no difference. That means it's statistically insignificant at a confidence interval of 95%, which is standard. This can be true if you're trying to lower the risk, in which case the ratio should always stay lower than 1, or looking for an increased risk, in which case the ratio should always be higher than 1.

Again, using the raloxifene/MORE study, women taking raloxifene 60 mg per day had a 68% decrease in the risk of new clinical vertebral fractures, and the 95% confidence interval was reported as 20% to 87%. Those percentages can be expressed as 0.68, 0.20 and 0.87 for the sake of consistency. So, yes, this finding was statistically significant at the 95% level; the CI didn't cross 1.

Let's rip an example from the headlines, as it were. In a study of celecoxib (Celebrex, Pfizer) and heart disease among patients in a clinical trial of colorectal adenoma prevention, subjects reached "a composite cardiovascular end point of death from cardiovascular causes, myocardial infarction, stroke, or heart failure" 1.0% of the time in the placebo group and 2.3% of the time in the treatment group; the latter received 200 mg of celecoxib bid (N Engl J Med e-publication ahead of print, 10.1056/NEJMoa050405). This gave a hazard ratio of 2.3, with a 95% confidence interval of 0.9 to 5.5. See something about those numbers? Yup, it crossed 1. So, at a 95% confidence interval, the finding is not statistically significant. Again, although it isn't true here, researchers may say this showed a "trend" toward significance.

And, just as you may see P values lower than 0.05, you may see confidence intervals greater than 95%. Again, by convention, 95% is usually good enough, but sometimes just to prove something closer to beyond a shadow of a doubt, studies will report 97% or 99%.

I hope these 5% rules have burned into your brain. If they haven't yet, perhaps you're an outlier--or, more likely, I'm an outlier when it comes to explaining these things clearly. But better that they're clear.

Ask the Expert

Deciphering P Values

Dear Dr. Oransky:

Your "Statistically Speaking" column in Pharmacy Practice News is wonderful! As a pharmacist still trying to learn about statistics, I find your information to be simple and easy to understand. I hope that you continue to put out more articles. A quick question:

I have seen data where a relative risk will have an associated P value, as shown in the example below, which looks at the higher risk of producing hyperglycemia when patients are on steroids versus when they don't have steroids. How and where did they come up with the P value?

John Vu, PharmD
Pharmacy Practice Resident
Methodist Dallas Medical Center
JohnVu@mhd.com

Dr. Oransky responds:

Thanks so much for reading, and for the kind words! I've noticed that P values seem to be creeping into evaluations of risk, hazard and odds ratios. It's as if people don't think confidence intervals are enough, although I have yet to see a case in which the confidence intervals do not cross 1 and yet the P value is greater than 0.05.

I consulted Ted Holford, a Professor of Biostatistics at Yale University and author of Multivariate Methods in Epidemiology, and here's what he said about P values and risk ratios: What the P value tells you in these cases is the probability of observing the data if the null hypothesis--eg, that the relative risk is equal to 1--is true, in much the same way as testing the null hypothesis that two means are equal for a normally distributed population.

How they calculate it is probably the same way they calculate most statistics--that is, they give it to a statistician or they throw all the data into a statistics program.

Study Group

Fingerstick blood sugars

Fingersticks resulting in hyperglycemia

Unadjusted relative risk (95% CI)

P

value

Steroids

235

149

1.57 (1.38-1.77)

<0.0001

No steroids

785

317

NA

NA