What happens if you treat ordinal ratings as interval data? Human evaluations in {NLP} are even more under-powered than you think

Howcroft, David M.; Rieser, Verena

What happens if you treat ordinal ratings as interval data? Human evaluations in {NLP} are even more under-powered than you think

Howcroft, David M.; Rieser, Verena

Authors

Dr. Dave Howcroft D.Howcroft@napier.ac.uk
Associate

Verena Rieser

Abstract

Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.

Citation

Howcroft, D. M., & Rieser, V. (2021, November). What happens if you treat ordinal ratings as interval data? Human evaluations in {NLP} are even more under-powered than you think. Presented at 2021 Conference on Empirical Methods in Natural Language Processing

Presentation Conference Type	Conference Paper (published)
Conference Name	2021 Conference on Empirical Methods in Natural Language Processing
Start Date	Nov 7, 2021
End Date	Nov 11, 2021
Acceptance Date	Aug 26, 2021
Publication Date	2021-11
Deposit Date	Dec 2, 2021
Publicly Available Date	Dec 2, 2021
Publisher	Association for Computational Linguistics (ACL)
Pages	8932-8939
Book Title	Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Public URL	http://researchrepository.napier.ac.uk/Output/2826175
Publisher URL	https://aclanthology.org/2021.emnlp-main.703

Files

What Happens If You Treat Ordinal Ratings As Interval Data? Human Evaluations In {NLP} Are Even More Under-powered Than You Think (425 Kb)
PDF

Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/

How speakers adapt object descriptions to listeners under load (2019)
Journal Article

OTTers: One-turn Topic Transitions for Open-Domain Dialogue (2021)
Presentation / Conference Contribution

Inducing Clause-Combining Rules: A Case Study with the SPaRKy Restaurant Corpus (2015)
Presentation / Conference Contribution

G-TUNA: a corpus of referring expressions in German, including duration information (2017)
Presentation / Conference Contribution

Most NLG is Low-Resource: here's what we can do about it (2022)
Presentation / Conference Contribution

Downloadable Citations

HTML

BIB

RTF

Authors

Abstract

Citation

Files

You might also like

Downloadable Citations