Hypothesis testing - A visual testimony to its failure

12 Jun 2014


It should be clear from my previous posts that I'm not a fan of hypothesis testing (which is sometimes called model comparison in the Machine Learning and Bayesian Statistics). Once we settled on the detective approach to data analysis - and this is at least in principle not controversial - then the next fork where most researchers go astray is the choice between hypothesis testing (HT) and parameter estimation (PE). PE has been advocated by the new statistics movement and in principle should not be controversial either. Suprisingly, the recent advocates (here and here) come from the bayesian corner where PE has already been dominant. These advocates offer three arguments:

a) HT (in terms of falsifying existing theories) features prominently in Popper's philosophy of science. You don't want to stage an attack on Popper, do you?! Anyway, there doesn't seem to be any alternative to Popper at the moment so - for better or worse - we are stuck with HT.

b) Their argument is actually for a mixture of HT and PE (all that matters is that it's bayesian). So their argument is that more is better. The radical position here is mine, because I'm claiming that HT should be abandoned.

c) The failure of HT so far is due to frequentist statistics. If we use bayesian model comparison instead, these problems go away.

The frequentist apologetes may add that HT fails because it is not applied adequately. In psychology

d) questionable research practices are prevalent e) replication efforts are rare f) null results are not reported/published

a) will require some more discussion and I'm already tackling it in the parallel series of posts.

b) can be handled quickly. Note that the transition between PE and HT is gradual.

When the hypotheses become very numerous, a different approach [rather than hypothesis testing] seems called for. A set of discrete hypotheses can always be classified by assigning one or more numerical indices which identify them, as in $H_t, 1<t<n$ and if the hypotheses are very numerous one can hardly avoid doing this. Then deciding between the hypotheses $H_t$ and estimating the index $t$ are practically the same thing, and it is a small step to regard the index, rather than the hypotheses, as the quantity of interest; then we are doing parameter estimation. (Jaynes,p. 149)

As a direct consequence, PE and HT overlap in practice. If I know that 95% interval of the parameter estimate is [0.1,0.2] then I know that the estimate is significantly different from zero (at $\alpha = 0.05$, undirected test). Note however, that the overlap is only partial. I can't recover the effect magnitude and its percentile interval from a statement about it's significance. If HT is a reduced version of PE, then someone who advocates HT is asking us to constrain our inference rather than expand it.

c,d,e and f are irrelevant with respect to the failure of HT. This can be easily shown after some thought. I illustrate with an exercise in image processing.

In a distant universe the reality/truth looks like this:

In [1]:
%pylab inline
from scipy.stats import scoreatpercentile as sap
import urllib2 as urllib			
from PIL import Image
from cStringIO import StringIO			
imgfile = urllib.urlopen('')
im = StringIO(
I =
def showImage(I):
Populating the interactive namespace from numpy and matplotlib

Reality is an ananas and the beings in this reality wish to find out what reality looks like. They set up a research institute to find out. The research institute is subdivided into departments. Each department consists of specialized groups which consist of scientists. At the bottom of the hierarchy each scientist studies a single pixel. The pixel measurements are noisy. Each scientist gathers multiple measurements and tries to figure out what the picture looks like. We create the data.

In [2]:
N=30 # number of observations
for i in range(N):

PE works like breeze. We estimate the intensity at each pixel and along with the CI.

In [3]:
In [4]:
In [5]:

Reality is ananas.

Let's try HT. H0 says that all pixel values are equal. H1 says that they are different. For simplicity we use z-test. The result of HT looks like this.

In [6]:

The white pixels are significantly different different from the expected value of random proces (0.5). In the case of the black pixels the test was inconclusive. For the benefit of HT, let's say our test was bidirectional. Then our test shows the following picture.

In [7]:

Not bad. Note however that the gray pixels are not gray but rather indicate a question mark. With enough data the gray pixels disappear and the result of HT looks like this.

In [8]:

This is the black-and-white version of the PE estimate. It's difficult to recognize the ananas.

The usual response would be to run subsequent hypothesis tests with more specific hypotheses that would tell us what kind of shade of gray the pixels are. It should be clear that such an effort is a waste of data. We already saw that we do have enough data to make a reasonable guess with PE.

Another option would be to test multiple hypotheses from the outset - say ten hypotheses with equally spaced range over the whole gray scale. In the limit this approach reduces to PE.

Note that replication, bayes factors or power calculation do not help us a bit here. The problem is the binary filter. The reality is not black and white and HT and it's binary filter does not work.

Now here is the killer argument from the HT advocate. Researcher A in our ananas-reality comes up with the alternative hypothesis that the reality looks like is this:

In [9]:

Spot on, he gets every single pixel value right with a float-point precision!

He tests this hypothesis against the null that all pixels are equal. We use F-test.

In [10]:
from scipy import stats
print 'F = %.3f, p = %.4f'%(F,1-stats.f.cdf(F,1,N-1))
F = 22.426, p = 0.0001

Super-significant. Great.

However what happens if another researcher B comes up with another theory-of-everything which happens to be random noise. He tests against the same $H_0$

In [11]:
print 'F = %.3f, p = %.4f'%(F,1-stats.f.cdf(F,1,N-1))
F = 9.908, p = 0.0038