How Bayes saves you from p-hacking

25 Mar 2014


Simmons, Nelson and Simonsohn (2011) write in their false-positive psychology paper:

Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by-case basis, providing yet more researcher degrees of freedom.

More recently Simonsohn shows

that the two Bayesian approaches that have been proposed within Psychology, Bayesian inference and Bayes factors, are as invalidated by selective reporting as p-values are.

In the former case he shows that the posterior estimate of a difference between two experimental conditions is very similar to frequentist estimate. These estimates however were obtained with data where a condition was dropped from analysis and the analyst chose one of many possible covariates. Presumably, he did so with a goal in mind of obtaining a posterior 95 % interval that does not include zero difference.

My response:

  1. There is nothing wrong with the bayesian analysis. It gives you estimate conditional on the data. If you feed in biased sample then you will obtain biased estimate. Garbage in, garbage out.

  2. Viewed from another point, you haven't included all your prior knowledge in the analysis. The information about how you 'post-processed' your data should be included into analysis. In fact, a well-informed causal analysis may allow you to recover an unbiased estimate even from a biased sample. Anyway, leaving out relevant prior knowledge from analysis is not very bayesian. It's not bayesian at all!

  3. The bayesian estimation approach tells you nothing about the decision you should take. In particular it does not tell you whether you should accept or reject a hypothesis. The goal of parameter estimation is to accurately estimate the magnitude of a quantity of interest. The researcher then decides to focus on zero effect as a $H_0$ (why not take a region of interest?), he decides to report the 95% posterior interval (why not 99% or 90%?) and to compare these two. These are additional decisions that go beyond parameter estimation. The problem arises because of the binary hypothesis-testing filter that is applied to the posterior estimate. If precision is the goal of the estimation approach, we may imagine a parallel universe or a future world where the precision is the target of hacking. Even if we accept that precision-hacking may be a problem (which I'm not so sure), currently p-hacking is the problem and currently there is no corresponding incentive that would make precision-hacking interesting. A solution to avoid p-hacking is then to avoid p-values (and hypothesis testing more generally).

  4. A frequentist may come up with a defense analogous to my points 1 a 3. This only shows, he argues, that p-hacking is a problem of a) model comparison/hypothesis testing b) incentives that bias scientists towards obtaining significant results and c) researcher degrees of freedom. Indeed, P-hacking concerns the bayesian/frequentist distinction only indirectly: a) bayesian parameter estimation is preferred to model comparison as evidenced by recent bayesian textbooks (Gelman et al. 2003, Kruschke, 2010). It is certainly more frequent than parameter estimation in frequentist analyses. b) as I wrote in 3 the incentives target model comparison and p-values in particular. As a consequence bayesian estimation is least likely to be affected by the faulty incentives. c) bayesian analysis either decreases the researcher degrees of freedom or makes them explicit, transparent and accountable.

I want to elaborate on c). First, let's get rid of the reservations mentioned by Simmons et al.

it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data

This is similar to the observation that by increasing the number of physicians we increase the number of patients who die due to mistreatment by a physician. So the conclusion is, we should decrease the number of physicians. Of course we could decrease DOF by having only single analysis method. But this only trades fewer DOF for inflexible analysis that gives incorrect results, which is clearly unacceptable.

Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by-case basis

I refuse to accept that choice of prior distribution increases researcher's DOF. Not in the modern bayesian analysis and not in my experience with fitting models to data. Even if there were DOF in the choice of priors, their definition is part of the model formulation and as such is explicit, replicable and accountable (unlike omitting covariates or measures from analysis).

My main message is however that bayesian analysis actually decreases DOF in the analysis. This happens because standard frequentist analysis used in psychological research is inflexible and breaks down on anything more complex than a toy data set. This requires ad-hoc fixes which introduce decisions and DOF. Here are some examples:

  1. What about missing values? Should we exclude all cases where at least one input is missing. Or should we exclude only cases where predictors are missing? Another approach is to insert sample mean or some other descriptive statistic for the missing values. In bayesian approach these choices do not arise. The missing values are estimated along with other parameters.

  2. A related problem concerns censored measurements. For instance, response times are usually constrained by an upper threshold. If the response takes too long we terminate the trial and resume with the next trial. In such a case the exact value remains unknown. All we know is that the value is higher than the threshold. How should we handle censored data? One approach is to exclude them from response time analysis and model omissions separately. Or we may insert the threshold value for censored value. One may also decide to insert random values (that are higher than the threshold) or an expected value. In both cases we need to separately model the tail distribution above the threshold. In bayesian analysis we simply add cumulative probability $p(x>\theta)$ to the likelihood function. The resulting parameter estimates incorporate the information provided by the censored data.

  3. Multiple comparisons deserve an honorable mention. There are various proposals how to correct nominal $\alpha$ when doing multiple comparisons. All these corrections trade decrease in power for fewer false-positives. Instead bayesian hierarchical modeling adjusts the magnitude of the estimate (= effect size) directly. As a consequence, the problem with multiple comparisons does not arise.

Bayesian analysis is more flexible than the frequentist methods. We can and should take advantage of this flexibility. The first recommendation in Gelman and Hill (2006) is to fit many models. After an iteration of accepted, rejected and extended models I sometimes keep wondering whether my expectations did not influence my model selection process. I really can't ensure that. But, a) sticking to invalid model because that was my first choice is not the solution and b) we can document the choices so that other researchers can have a look at them and point out any aspects that have been neglected. The documentation can be done easily with Ipython Notebooks. Oh, did I mention that this blog is written in Ipython Notebook?!

comments powered by Disqus