Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definition

Howcroft, David; Belz, Anya; Clinciu, Miruna; Gkatzia, Dimitra; Hasan, Sadid A.; Mahamood, Saad; Mille, Simon; van Miltenburg, Emiel; Santhanam, Sashank; Rieser, Verena

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definition

Howcroft, David; Belz, Anya; Clinciu, Miruna; Gkatzia, Dimitra; Hasan, Sadid A.; Mahamood, Saad; Mille, Simon; van Miltenburg, Emiel; Santhanam, Sashank; Rieser, Verena

Authors

David Howcroft

Anya Belz

Miruna Clinciu

Dr Dimitra Gkatzia D.Gkatzia@napier.ac.uk
Associate Professor

Sadid A. Hasan

Saad Mahamood

Simon Mille

Emiel van Miltenburg

Sashank Santhanam

Verena Rieser

Abstract

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

Citation

Howcroft, D., Belz, A., Clinciu, M., Gkatzia, D., Hasan, S. A., Mahamood, S., Mille, S., van Miltenburg, E., Santhanam, S., & Rieser, V. (2020, December). Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definition. Presented at International Conference on Natural Language Generation (INLG 2020), Dublin, Ireland

Presentation Conference Type	Conference Paper (published)
Conference Name	International Conference on Natural Language Generation (INLG 2020)
Start Date	Dec 15, 2020
End Date	Dec 18, 2020
Acceptance Date	Oct 12, 2020
Publication Date	2020-12
Deposit Date	Nov 2, 2020
Publicly Available Date	Nov 2, 2020
Publisher	Association for Computational Linguistics (ACL)
Pages	169-182
Book Title	Proceedings of the 13th International Conference on Natural Language Generation
Public URL	http://researchrepository.napier.ac.uk/Output/2697597
Publisher URL	https://www.aclweb.org/anthology/2020.inlg-1.23/