Abstract
Recent unsupervised multi-modal machine translation methods have shown promising performance for capturing semantic relationships in unannotated monolingual corpora by large-scale pretraining. Empirical studies show that small accessible parallel corpora can achieve comparable performance gains of large pretraining corpora in unsupervised setting. Inspired by the observation, we think semi-supervised learning can largely reduce the demand of pretraining corpora without performance degradation in low-cost scenario. However, images of parallel corpora typically contain much irrelevant information, i.e., visual noises. Such noises have a negative impact on the semantic alignment between source and target languages in semi-supervised learning, thus weakening the contribution of parallel corpora. To effectively utilize the valuable and expensive parallel corpora, we propose a Noise-robust Semi-supervised Multi-modal Machine Translation method (Semi-MMT). In particular, a visual cross-attention sublayer is introduced into source and target language decoders, respectively. And, the representations of texts are used as a guideline to filter visual noises. Based on the visual cross-attention, we further devise a hybrid training strategy by employing four unsupervised and two supervised tasks to reduce the mismatch between the semantic representation spaces of source and target languages. Extensive experiments conducted on the Multi30k dataset show that our method outperforms the state-of-the-art unsupervised methods with large-scale extra corpora for pretraining in terms of METEOR metric, yet only requires 7% parallel corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Calixto, I., Chowdhury, K.D., Liu, Q.: DCU system report on the WMT 2017 multi-modal machine translation task. In: Proceedings of the Second Conference on Machine Translation (WMT), pp. 440–444 (2017)
Calixto, I., Liu, Q.: Incorporating global visual features into attention-based neural machine translation. In: EMNLP, pp. 992–1003. Association for Computational Linguistics (2017)
Calixto, I., Rios, M., Aziz, W.: Latent variable model for multi-modal translation. In: ACL (1), pp. 6392–6405. Association for Computational Linguistics (2019)
Chen, S., Jin, Q., Fu, J.: From words to sentences: A progressive learning approach for zero-resource machine translation with visual pivots. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pp. 4932–4938 (2019)
Cheng, Y., et al.: Semi-supervised learning for neural machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) (2016)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)
Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30k: Multilingual english-german image descriptions. In: Proceedings of the 5th Workshop on Vision and Language, hosted by the 54th Annual Meeting of the Association for Computational Linguistics (VL@ACL), pp. 627–633 (2016)
Gehring, J., Auli, M., Grangier, D., and D.Y.: Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1243–1252 (2017)
Grönroos, S., et al.: The memad submission to the WMT18 multimodal translation task. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers (WMT), pp. 603–611 (2018)
Han, Y., Li, L., Zhang, J.: A coordinated representation learning enhanced multimodal machine translation approach with multi-attention. In: Proceedings of the 2020 on International Conference on Multimedia Retrieval (ICMR), pp. 571–577 (2020)
Helcl, J., Libovický, J., Varis, D.: CUNI system for the WMT18 multimodal translation task. In: Proceedings of the Third Conference on Machine Translation (WMT), pp. 616–623 (2018)
Huang, P., Sun, S., Yang, H.: Image-assisted transformer in zero-resource multi-modal translation. In: Proceedings of the 2021IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7548–7552 (2021)
Ive, J., Madhyastha, P., Specia, L.: Distilling translations with visual awareness. In: ACL, pp. 6525–6538. Association for Computational Linguistics (2019)
Karita, S., Watanabe, S., Iwata, T., Delcroix, M., Ogawa, A., Nakatani, T.: Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6166–6170 (2019)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the Third International Conference on Learning Representations (ICLR) (2015)
Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018)
Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation (WMT@ACL), pp. 228–231 (2007)
Li, L., Hu, K., Zheng, Y., Liu, J., Lee, K.A.: Coopnet: Multi-modal cooperative gender prediction in social media user profiling. In: ICASSP, pp. 4310–4314. IEEE (2021)
Li, L., Tayir, T., Hu, K., Zhou, D.: Multi-modal and multi-perspective machine translation by collecting diverse alignments. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds.) PRICAI 2021. LNCS (LNAI), vol. 13032, pp. 311–322. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89363-7_24
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 311–318 (2002)
Su, Y., Fan, K., Bach, N., Kuo, C.J., Huang, F.: Unsupervised multi-modal neural machine translation. In: Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10482–10491 (2019)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31th Conference on Neural Information Processing Systems (NIPS), pp. 5998–6008 (2017)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 1096–1103 (2008)
Wang, Y., et al.: Semi-supervised neural machine translation via marginal distribution estimation. IEEE ACM Trans. Audio Speech Lang. Process. 27(10), 1564–1576 (2019)
Xu, W., Niu, X., Carpuat, M.: Dual reconstruction: a unifying objective for semi-supervised neural machine translation. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 2006–2020 (2020)
Yao, S., Wan, X.: Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4346–4350 (2020)
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China (62276196), the Key Research and Development Program of Hubei Province (No. 2021BAA030) and the China Scholarship Council (LiuJinMei [2020] 1509, 202106950041).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, L., Hu, K., Tayir, T., Liu, J., Lee, K.A. (2022). Noise-Robust Semi-supervised Multi-modal Machine Translation. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13630. Springer, Cham. https://doi.org/10.1007/978-3-031-20865-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-20865-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20864-5
Online ISBN: 978-3-031-20865-2
eBook Packages: Computer ScienceComputer Science (R0)