In this paper, we present a method for the discovery of word-like units and their approximate translations from visually grounded speech across multiple languages. We first train a neural network model to map images and their spoken audio captions in both English and Hindi to a shared, multimodal embedding space. Next, we use this model to segment and cluster regions of the spoken captions which approximately correspond to words. Finally, we exploit between-cluster similarities in the embedding space to associate English pseudo-word clusters with Hindi pseudo-word clusters, and show that many of these cluster pairings capture semantic translations between English and Hindi words. We present quantitative cross-lingual clustering results, as well as qualitative results in the form of a bilingual picture dictionary.
Cite as: Azuh, E., Harwath, D., Glass, J. (2019) Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio. Proc. Interspeech 2019, 276-280, doi: 10.21437/Interspeech.2019-1718
@inproceedings{azuh19_interspeech, author={Emmanuel Azuh and David Harwath and James Glass}, title={{Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={276--280}, doi={10.21437/Interspeech.2019-1718} }