Dr. Dave Howcroft D.Howcroft@napier.ac.uk
Research Fellow
What happens if you treat ordinal ratings as interval data? Human evaluations in {NLP} are even more under-powered than you think
Howcroft, David M.; Rieser, Verena
Authors
Verena Rieser
Abstract
Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.
Citation
Howcroft, D. M., & Rieser, V. (2021). What happens if you treat ordinal ratings as interval data? Human evaluations in {NLP} are even more under-powered than you think. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (8932-8939)
Conference Name | 2021 Conference on Empirical Methods in Natural Language Processing |
---|---|
Start Date | Nov 7, 2021 |
End Date | Nov 11, 2021 |
Acceptance Date | Aug 26, 2021 |
Publication Date | 2021-11 |
Deposit Date | Dec 2, 2021 |
Publicly Available Date | Dec 2, 2021 |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 8932-8939 |
Book Title | Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing |
Public URL | http://researchrepository.napier.ac.uk/Output/2826175 |
Publisher URL | https://aclanthology.org/2021.emnlp-main.703 |
Files
What Happens If You Treat Ordinal Ratings As Interval Data? Human Evaluations In {NLP} Are Even More Under-powered Than You Think
(425 Kb)
PDF
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
You might also like
Building a dual dataset of text-and image-grounded conversations and summarisation in GĂ idhlig (Scottish Gaelic)
(2023)
Conference Proceeding
enunlg: a Python library for reproducible neural data-to-text experimentation
(2023)
Conference Proceeding
LOWRECORP: the Low-Resource NLG Corpus Building Challenge
(2023)
Conference Proceeding
Most NLG is Low-Resource: here's what we can do about it
(2022)
Conference Proceeding
OTTers: One-turn Topic Transitions for Open-Domain Dialogue
(2021)
Conference Proceeding
Downloadable Citations
About Edinburgh Napier Research Repository
Administrator e-mail: repository@napier.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search