Low p-values are good for you (or at least unavoidable)

Why at least 5% of p-values are ≤5% (even if the null hypothesis is true)

Imagine that 1001 laboratories – L₁, L₂, ..., L₁₀₀₁ – are enlisted in a phony study, ostensibly to test the mutagenic strength of a new chemical. Each laboratory is to test the chemical using the same standard protocol, by spreading a fixed amount of diluted treated bacteria on several Petri dishes, and counting the total number of mutant colonies.

(Null) hypothesis = "The tested chemical has no effect."

Test statistic = # of mutated colonies on several Petri dishes

The study is phony because each laboratory assumes (perhaps because they are told) that they, and only they, are testing the new mutagen; all the other laboratories are merely controls, testing water. In reality, every laboratory is testing inert, healthy, water.

Nonetheless, there is a certain rate of mutations even among control colonies, and it has a random element, so some labs will report more mutations than others. At the end of the experiment, put yourself in the shoes of laboratory L₁ and ask what is the p-value for the null hypothesis.

Lab L₁ came up with some test statistic value – say 7942.

P-value means – What is the chance to observe such a high test statistic, if the null hypothesis is really true?

Restated: What is the chance to observe 7942 mutant colonies if the chemical has no effect?

Since water has no effect, and "chance" means "probability" that is the same as asking:

What is the probability to observe 7942 mutant colonies when the bacteria are treated with water?

That is: What is the probability to observe 7942 mutant colonies, given exactly the experiment that was performed 1000 times, by laboratories L₂, L₃, ..., L₁₀₀₁? In other words, from L₁'s point of view, the other labs' purpose is to calibrate the test statistic.

Since "probability" means long-run frequency given repeated trials, and each laboratory's work can be regarded as a trial, that's essentially the same as asking: What % of the 1000 test statistics reported are greater than or equal to 7942?⁽¹⁾

Imagine that we arrange all the test statistics in order of size, and assign ordinal ranks from largest to smallest:

Lab: L₂₁₁ L₅₉₂ L₈₈ ... L₁ ... L₉₁₆ L₁₄₇ L₁₈ L₆₆₆

Test stat: 7 120 249 ... 7942 ... 12122 14229 21000 92929

Ordinal position: 1000 999 998 ... 108 ... 3 2 1 0

p-value: p=1 .999 .998 ... .108 ... .003 .002 .001 p=0

In this way we get an empirical estimate p=0.108 as the p-value for L₁'s score. It just means that L₁'s score lies at the 10.8%ile mark (counting from 0=largest), among a large set of "control" scores (scores obtained or expected assuming the null hypothesis).

But what goes for L₁ goes for every other lab as well. Remember, they all used water. Each of those labs ends up with a p-value corresponding to their ordinal position.

Five percent of the labs occupy the 5% extreme right end of the picture, and all of these labs therefore necessarily have a p-value≤5%. So, assuming the null hypothesis, a p-value≤5% occurs 5% of the time – which is what was to be shown. (If the null hypothesis is false, then small p-values occur even more often.)

Of course, what is true for 5% is equally true for any other number. When the null hypothesis is true, the % of labs that will report a p-value x is exactly x, for any probability x. Or in symbols,

Pr(p-value ≤ x | null hypothesis) = x.

1. To avoid ascertainment bias, we shouldn't count the lab itself as one of the labs with a "greater or equal test statistic."

Go to home page of Charles H. Brenner

Lab:	L₂₁₁	L₅₉₂	L₈₈	...	L₁	...	L₉₁₆	L₁₄₇	L₁₈	L₆₆₆
Test stat:	7	120	249	...	7942	...	12122	14229	21000	92929
Ordinal position:	1000	999	998	...	108	...	3	2	1	0
p-value:	p=1	.999	.998	...	.108	...	.003	.002	.001	p=0