Sometimes we categorise continuous data rather sloppily. For example, it’s common to judge the scope of tasks to be “small”, “medium”, or “large” rather than the actual time estimated. Any conclusions we draw from categorised data are greatly affected by this practise, but we rarely look at exactly what the effects are before we choose to categorise. There’s a brief paper that looks into the prevalence of this practise and gives some (common-sense) recommendations around the practise1 Categorisation of continuous risk factors in epidemiological publications; Turner, Dobson, & Pocock; Epidemiological Perspectives & Innovations; 2010. Available online..
- Ideally, use categories that are well defined and have been proven to be clinically useful2 This is the case with bmi, for example..
- In the absence of prior art, try multiple ways of categorising the data, but don’t use it to p-hack3 I.e. don’t try multiple categorisations until you find one that generates the outcome you want and then hide how you got your desired outcome.. When we present our results, we should present all the ways of categorising data that we tried but didn’t end up using.
- Regardless of how many or few categories we choose, we should display numerical results both in a table and visually in a plot.
And, perhaps the one that was most surprising to me:
- Dichotomisation4 Splitting a value into two categories, higher than and lower than a threshold loses us statistical power equivalent to throwing a third of the data in the rubbish bin!