No Way Anova - Binning of continuous variables is bad style

15 Feb 2015


Steven Pinker has a new book out. It's a style manual - a book about what authors can do to improve the clarity of their prose. Actually, the book would be more accurately described as guide to style manuals. Unlike, other style manuals Pinker takes stock of his knowledge of cognitive science and linguistics and explains the logic behind various rules that pertain to language's use. Once this logic is explained the reader can apply it to cases that haven't been covered by the book. The book is great and I recommend it to every scientist.

Anyway, I want to discuss one of the examples that Pinker uses to illustrate bad style. As scientist, Pinker takes many examples from academic writing. My favorite is:

Participants read assertions whose veracity was either affirmed or denied by the subsequent presentation of an assessment word.

The so-called nominalizations (assertions, veracity, presentation, assessment) make this sentence difficult to comprehend. Once we remove them we get:

We presented participants with a sentence, followed by the word TRUE or FALSE.

In addition to nominalizations Pinker warns against abstract nouns such as levels, strategies, perspective or prospects. Just like with nominalizations, omitting the abstract nouns makes comprehension easier. Here is an example from Pinker. Abstract nouns are underlined.

The researchers found that groups that are typically associated with low alcoholism levels actually have moderate amounts of alcohol intake yet still have low levels of high intake associated with alcoholism, such as Jews.

Pinker recommends his version with abstract nouns omitted.

The researchers found that in groups with little alcoholism such as Jews, people actually drink moderate amounts of alcohol, but few of them drink too much and become alcoholics.

I want to discuss Pinker's replacement of "low alcoholism level" with "little alcoholism". Where does the "low alcoholism level" come from? It's not difficult to guess. The original measure was the proportion of alcoholism in certain populations. This is a continuous variable. The authors wanted to predict the amount of alcohol from alcoholism level. Instead of running a GLM regression, the researchers binned populations into groups according to alcoholism. Each group then describes (discrete) alcoholism level. I agree with Pinker that "alcoholism level" and indeed any case of discretized variable is a monstrosity and should be avoided. But the problem does not go away, if we just rewrite the sentence. Pinker's revision makes a different claim than the original text. What Pinker highlights is that the author's claim is of little interest since no one thinks of alcoholism as a discrete variable. At the same time, it is easy to see that the claim, that we are interested in, concerns the continuous variable. As such this claim can't be distilled with Anova. Something has to give. Psychologists give up their claims and their research questions. I say, they should abandon Anova.

comments powered by Disqus