Skip to main content

Research Repository

Advanced Search

What happens if you treat ordinal ratings as interval data? Human evaluations in {NLP} are even more under-powered than you think

Howcroft, David M.; Rieser, Verena

Authors

Verena Rieser



Abstract

Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.

Citation

Howcroft, D. M., & Rieser, V. (2021). What happens if you treat ordinal ratings as interval data? Human evaluations in {NLP} are even more under-powered than you think. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (8932-8939)

Conference Name 2021 Conference on Empirical Methods in Natural Language Processing
Start Date Nov 7, 2021
End Date Nov 11, 2021
Acceptance Date Aug 26, 2021
Publication Date 2021-11
Deposit Date Dec 2, 2021
Publicly Available Date Dec 2, 2021
Publisher Association for Computational Linguistics (ACL)
Pages 8932-8939
Book Title Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Public URL http://researchrepository.napier.ac.uk/Output/2826175
Publisher URL https://aclanthology.org/2021.emnlp-main.703

Files





You might also like



Downloadable Citations