Here is a list of problems that arise if Anova is used as default tool to analyze designs with multiple variables.
1. Anova is a tool for model comparison, which doesn't tell us as much as parameter estimation could.
2. Variance and in fact any quantity on the quadratic scale is difficult to interpret.
Usually we do not only wish to find out whether there is a main effect or interaction. Researchers also want to test whether their theory predicts the direction of the effect correctly. Anova is therefore standardly supplemented by post-hoc comparisons. This raises additional issues, all of which boil down to the fact that these comparisons use raw cell averages which do not provide appropriate estimate. For instance consider the following scenario. We have a factor with ten levels and nine out of ten cells have averages in $[-1,1]$. The tenth average is $10$. If the sample size per cell is small we should distrust the estimate of the tenth average. Instead we would locate it in the band $[-1,1]$ or maybe somewhere in the range $[1,10]$. This can be done more systematically by partially pooling the cell estimates with a hierarchical prior (or using random-effects model in the context of frequentist statistics). What happens if one uses the raw averages instead?
3. The differences between the averages overestimate the true difference. As a consequence one will obtain more significant differences than one would obtain with accurate estimates. This issue is more commonly framed in terms of multiple comparisons. If one makes ten comparisons at nominal $\alpha=0.05$ then the probability to obtain at least one significant result is $0.5$. To make the results independent of number of comparisons, researchers proposed various corrections that adjust the nominal $\alpha$. This does not however tackle the core of the problem, namely that the magnitude of the difference is overestimated.
4. An alternative approach is to use $p$ values (instead of computing differences directly) to compare conditions across experiments and across studies. Since the $p$ values depend on the $H_0$ and $H_0$ varies between experiments and studies this is in general not possible. Even if the $H_0$ are comparable (as is the case when zero difference is used as a default $H_0$) we often get the following claim. If the DV is significantly different from zero in study A and it is not significantly different from zero in study B then it is assumed that the DV significantly differs between study A and B. This is a non sequitur. Consider a mean of $z=1.9$ in study A and mean of $z=2.0$ in study B. Then $P(z>2.0)=0.023$ and $P(z>1.9)=0.029$. Assuming undirected alternative hypothesis and $\alpha=0.05$ we obtain a significant result in study B but not in study A. So performance in study B is significantly different from study A. However if we compare the two effects directly, we get $P(z>0.1=2.0-1.9)=0.46$. This logic is clear if we compare two levels, but becomes obfuscated when we have a variable with more levels and analyze these data with Anova.
5. Anova is not applicable to unbalanced designs. Additional fixes such as sphericity correction are required. In unbalanced designs the standard error of the cell mean estimate varies between cells. The post-hoc comparisons of raw averages ignore this phenomenon.
6. Another requirement critical for post-hoc comparisons is that variance does not vary across cells. This may not always be the case, but is rarely checked in the reported literature.
7. Multiple comparisons are not only problem for post-hoc analysis. We have to perform a test for each Anova level. The number of tests and the probability to obtain at least one significant result then increases with the number of variables (plus interactions) included in the Anova. According to Maxwell (2004) this may be a reason for prevalence of underpowered Anova studies. Researchers target some significant result by default, instead of planning sample size that would provide enough power so that all effects can be reliably discovered.
8. The terms for interaction coefficients are estimated with lower precision. As a consequence interactions require higher power and more subjects to estimate with the same precision as main effects. In nested designs however the sample used to estimate the main effect is the same as the sample from which interactions are estimated. As a consequence main effects are more likely to reach significance than the interactions. Hypothesis that posits significant main-effects but no significant interactions is more likely to be true than other hypotheses. This advantage can be further exploited by carefully planning the sample size (or more pragmatically by optional stopping).
It would be certainly interesting to know the prevalence and the severity of the consequences of the listed problems. The items are certainly not equal in this respect. For my purpose however the list illustrates that Anova is not a tool for testing composite hypotheses. One should use multi-level regression (Gelman and Hill, 2007) instead. In multi-level regression the regression coefficients give as the effect size estimates (1) that are directly interpretable in terms of the units of the IV and DV (2). Multi-level models do adjust the magnitude of the mean estimates in the cell automatically. No additional corrections are needed (3). The mean estimates are usually obtained by sampling. These samples can be easily used to compare quantities between the experiments (4). They do not only provide the mean estimate but also allow us to compute the uncertainty of the estimate (e.g. quantiles). Hierarchical models take the differences in sample size and variance in various cells into account (5,6). Problems 7 and 8 do not arise.
It is not difficult to see how Anova slipped into its role of a default test for composite hypotheses. Fifty years ago multi-level modeling methods were simply not available. Fortunately they are available now, so let's use them.