With pre-registered fully bayesian analyses against the pre-registered replications

21 Nov 2016


In a note on his blog, John Kruschke asks whether it is legitimate to select a model based on data and then use the data again to obtain parameter estimates for the selected model. By model we mean a functional specification of the relationship between the variables of interest. The specification may include simple statements about a marginal distribution of a variable, a functional relationship between multiple variables or a set of equations that describe something as complex as the movement of planets. Using data twice is suspicious, because:

  1. In bayesian context the prior should be constructed based on prior knowledge and should be independent of data which enter via likelihood. By using data twice, the analysis may overestimate the precision of the posterior parameter estimates. On the other hand there is the saying that today's posterior is tomorrow's prior. So perhaps this case could be reformulated as a valid repeated application of Bayes theorem on the same data set?

  2. From the perspective of hypothesis testing, the analyst would conduct multiple tests so perhaps she should adjust the $\alpha$ level of each test so that the overall error rate is not altered by splitting the analysis into two separate tests. But perhaps this is not a problem for bayesian analysis which is not affected by the multiple comparison issues if done properly (Gelman, Hill & Yajima, 2012)

  3. Often the model selection step is informal, intransparent, not documented and hence not replicable. Hence the two-step procedure can be considered a questionable research practice. But perhaps we could just make the model selection more formal or at least document it in order to make it scientifically defendable?

So, is it legitimate to select a model based on data and then use the data again to obtain parameter estimates for the selected model? A short and theory-based answer is that the two-step procedure is not an issue if done properly. Formally, we first estimate probability of a model $M$ given data $D$ and then estimate the probability of parameters $\Theta$ given data by marginalizing over models.

$$p(\Theta|D)=\sum_m p(\Theta|D,M=m)p(M=m|D)$$

The practical problems arise because $p(\Theta|D)$ is usually not estimated properly - in accordance with the probability theory. Researchers instead compute say


where $m_\mathrm{MAP}$ is the model with highest probability given the data. Effectively we sum over the model space but we assign a prior with $p(M=m_\mathrm{MAP}|D)=\infty$ and $p(M\neq m_\mathrm{MAP}|D)=0$. Such prior knowledge state is implausible and the posterior parameter estimate will be inaccurate if p(M\neq m_\mathrm{MAP}|D)>0$. The latter procedure is sometimes referred to as Maximum posterior (MAP) solution while the former solution is called fully bayesian (see Bishop, 2006, p. 30).

Why would anyone use the inaccurate MAP solution? Kruschke points out some possibilities:

  1. The fully bayesian procedure may be computationally intractable, while MAP may be feasible. In these cases, we apparently need to decide between no analysis and potentially inaccurate analysis, so we go with MAP.

  2. In some applications of probability theory such as in machine learning, one does not use models for causal discovery and hence the model is not required to reflect how nature works. Instead the models are used to obtain accurate prediction. Naive bayes algorithm implemented by spam filters is a good example of an implausible model that provides good predictions.

However, in psychology we are almost always interested in causal discovery so we can scratch (2). Unlike Kruschke I don't think (1) is relevant. Perhaps the analytic solution is intractable but the fully bayesian estimates can be obtained with approximate procedures such as Markov Chain Monte Carlo. Indeed I think the modern MCMC algorithms (HMC, NUTS) are powerful enough to accommodate almost all of the modeling ideas a psychology researcher may come up with. In my research I always attempt such fully bayesian solution. In my experience the solution is computationally tractable, however it poses an entirely different challenge.

The challenge is that the data were obtained from an experiment that was designed to estimate some specific parameter (location, shape parameter, or ). Or even worse the experiment was designed for hypothesis testing - say to determine ordinal relationship between outcome $y_1$ (experimental group) and $y_0$ (control group) for manipulation at $X=x_1$ and $X=x_0$ respectively. The experiment was not designed to estimate the parametric function $Y=f(X,\Theta)$. In fact, in the extreme case of hypothesis testing with two groups, a function $f$ with two parameters is not identifiable because any function with two degrees of freedom (say offset and slope) will be able to fit the two points $(y_1,x_1)$ and $(y_0,x_0)$. If the data tell us little about what models are plausible then $p(M|D)=p(M)$. If you select a wide prior $p(M)$, as you should, then you will be summing over a wide prior. As a consequence your parameter estimates will be all over the place. Researchers will not be able to publish inaccurate estimates and so he learns to avoid fully bayesian solutions altogether. By using MAP the researcher potentially overestimates the precision and fools himself and his academic readership.

Ultimately, however we wish to obtain, publish and read about precise estimates. So what is the solution? John Kruschke recommends (pre-registered) replications. In general, throwing more data at imprecise estimates can't do any harm. However, because the original experiment was not designed to determine the functional specification, additional data will provide little information about the functional relationship and the replication will be effectively a waste of resources. In fact, as Rotello, Heit & Dube (2015) point out that replications in such context may provide false sense of validation and do more harm than good. Their point was that typical applications of signal detection theory can't derive the precise shape of the ROC curve, because they do not systematically wary the difficulty or the decision bias. This in turn constraints the validity of the conclusions obtained from such experiments.

The solution is to design specific experiments to precisely determine $p(M=m|D)$. One may wonder what kind of beast are studies that achieve such thing. These are actually rather common in personality psychology or education research. These are validation studies that serve to validate questionnaires or interventions. Currently, the core problem of psychological research, especially in social psychology, is that it uses unvalidated measures and unvalidated manipulations. If a researcher doesn't know what is being measured and how the manipulation affects the outcome, probability theory can't help him obtain precise estimates. All that probability theory can do via the fully bayesian solution is to veridically quantify the uncertainty and to point the researcher to the source of the uncertainty.

comments powered by Disqus