Loading [a11y]/accessibility-menu.js
Multimodal Word Discovery and Retrieval With Spoken Descriptions and Visual Concepts | IEEE Journals & Magazine | IEEE Xplore

Multimodal Word Discovery and Retrieval With Spoken Descriptions and Visual Concepts


Abstract:

In the absence of dictionaries, translators, or grammars, it is still possible to learn some of the words of a new language by listening to spoken descriptions of images....Show More

Abstract:

In the absence of dictionaries, translators, or grammars, it is still possible to learn some of the words of a new language by listening to spoken descriptions of images. If several images, each containing a particular visually salient object, each co-occur with a particular sequence of speech sounds, we can infer that those speech sounds are a word whose definition is the visible object. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcriptions) and learns a mapping from waveform segments (or phone strings) to their associated image concepts. In this article, four multimodal word discovery systems are demonstrated: three models based on statistical machine translation (SMT) and one based on neural machine translation (NMT). The systems are trained with phonetic transcriptions, MFCC and multilingual bottleneck features (MBN). On the phone-level, the SMT outperforms the NMT model, achieving a 61.6% F1 score in the phone-level word discovery task on Flickr30k. On the audio-level, we compared our models with the existing ES-KMeans algorithm for word discovery and present some of the challenges in multimodal spoken word discovery.
Page(s): 1560 - 1573
Date of Publication: 20 May 2020

ISSN Information:


Contact IEEE to Subscribe

References

References is not available for this document.