Skip to main content

Research Repository

Advanced Search

Building a dual dataset of text-and image-grounded conversations and summarisation in Gàidhlig (Scottish Gaelic)

Howcroft, David M; Lamb, Will; Groundwater, Anna; Gkatzia, Dimitra

Authors

Will Lamb

Anna Groundwater



Abstract

Gàidhlig (Scottish Gaelic; gd) is spoken by about 57k people in Scotland, but remains an under-resourced language with respect to natural language processing in general and natural language generation (NLG) in particular. To address this gap, we developed the first datasets for Scottish Gaelic NLG, collecting both conversational and summarisation data in a single setting. Our task setup involves dialogues between a pair of speakers discussing museum exhibits, grounding the conversation in images and texts. Then, both interlocutors summarise the dialogue resulting in a secondary dialogue summarisation dataset. This paper presents the dialogue and summarisation corpora, as well as the software used for data collection. The corpus consists of 43 conversations (13.7k words) and 61 summaries (2.0k words), and will be released along with the data collection interface.

Citation

Howcroft, D. M., Lamb, W., Groundwater, A., & Gkatzia, D. (2023, September). Building a dual dataset of text-and image-grounded conversations and summarisation in Gàidhlig (Scottish Gaelic). Presented at The 16th International Natural Language Generation Conference

Presentation Conference Type Conference Paper (Published)
Conference Name The 16th International Natural Language Generation Conference
Start Date Sep 11, 2023
End Date Sep 15, 2023
Acceptance Date Jul 12, 2023
Online Publication Date Sep 11, 2023
Publication Date 2023-09
Deposit Date Nov 15, 2023
Publisher Association for Computational Linguistics (ACL)
Pages 443-448
Book Title Proceedings of the 16th International Natural Language Generation Conference
ISBN 9798891760011
Public URL http://researchrepository.napier.ac.uk/Output/3385879
Publisher URL https://aclanthology.org/2023.inlg-main.34/