Skip to main content

Research Repository

Advanced Search

Optimising strategies for learning visually grounded word meanings through interaction

Yu, Yanchao

Authors



Abstract

Language Grounding is a fundamental problem in AI, regarding how symbols in Natural Language (e.g. words and phrases) refer to aspects of the physical environment (e.g. ob jects and attributes). In this thesis, our ultimate goal is to address an interactive language grounding problem, i.e. learning perceptual groundings (specifically vision) through Nat ural Language (NL) interaction with humans. Although some previous work has shown significant progress on language/symbol grounding on different tasks, there are still some limitations and unsolved problems: (a) only learning groundings holistically without under standing individual parts of the linguistic and non-linguistic context, (b) requiring training data of high quantity and quality, but without the possibility of on-line error correction, and (c) not being able to continuously and incrementally learn from the external environment. Most these limitations are likely to be alleviated if systems can learn symbol groundings, as and when needed, from natural, everyday conversations with humans. For working on all of the above limitations at once, this thesis proposes a modular Interactive Multi-modal Framework, which is compositional, optimised, trainable incrementally with small amounts of data, and able to handle natural, spontaneous dialogue. Specifically, we collect real human-human conversations (BURCHAK corpus) for investigating how humans behave in an interactive learning task, which contains a wide range of dialogue capabilities, strategies, and linguistic phenomena encountered in natural, spontaneous dialogue. This thesis then explores how different capabilities and strategies (from the real data) affect the overall learning/grounding efficiency, i.e. higher recognition accuracy with less human effort in the dialogue. We found that an agent, that is able to: 1) take initiative, 2) consider both uncertainty from visual classification and context-dependencies from dialogue, and 3) demand further information if necessary, performs better. Finally, following the above results, we train an optimised multi-modal dialogue agent using Reinforcement Learning for addressing interactive language grounding against the real data. The agent learns: (1) to perform a form of active learning, i.e. only ask further information if necessary, and (2) to process natural, daily conversations with humans. Here, we incorporate our framework with an incremental semantic formalism (the DS-TTR framework) that dynamically presents compositional representations for both linguistic and non-linguistic (visual) context, and is able to process natural, spontaneous conversations (specifically incremental phenomena, such as "self-repair"). These advances bring us closer to addressing the interactive grounding problem, and bringing robots from the laboratory into the real world, where they will need to speak in the same language as human beings.

Citation

Yu, Y. (2018). Optimising strategies for learning visually grounded word meanings through interaction. (Thesis)

Thesis Type Thesis
Deposit Date Jun 28, 2023
Award Date Nov 14, 2018


You might also like



Downloadable Citations