Skip to main content

Noise-Robust Semi-supervised Multi-modal Machine Translation

  • Conference paper
  • First Online:
PRICAI 2022: Trends in Artificial Intelligence (PRICAI 2022)

Abstract

Recent unsupervised multi-modal machine translation methods have shown promising performance for capturing semantic relationships in unannotated monolingual corpora by large-scale pretraining. Empirical studies show that small accessible parallel corpora can achieve comparable performance gains of large pretraining corpora in unsupervised setting. Inspired by the observation, we think semi-supervised learning can largely reduce the demand of pretraining corpora without performance degradation in low-cost scenario. However, images of parallel corpora typically contain much irrelevant information, i.e., visual noises. Such noises have a negative impact on the semantic alignment between source and target languages in semi-supervised learning, thus weakening the contribution of parallel corpora. To effectively utilize the valuable and expensive parallel corpora, we propose a Noise-robust Semi-supervised Multi-modal Machine Translation method (Semi-MMT). In particular, a visual cross-attention sublayer is introduced into source and target language decoders, respectively. And, the representations of texts are used as a guideline to filter visual noises. Based on the visual cross-attention, we further devise a hybrid training strategy by employing four unsupervised and two supervised tasks to reduce the mismatch between the semantic representation spaces of source and target languages. Extensive experiments conducted on the Multi30k dataset show that our method outperforms the state-of-the-art unsupervised methods with large-scale extra corpora for pretraining in terms of METEOR metric, yet only requires 7% parallel corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Calixto, I., Chowdhury, K.D., Liu, Q.: DCU system report on the WMT 2017 multi-modal machine translation task. In: Proceedings of the Second Conference on Machine Translation (WMT), pp. 440–444 (2017)

    Google Scholar 

  2. Calixto, I., Liu, Q.: Incorporating global visual features into attention-based neural machine translation. In: EMNLP, pp. 992–1003. Association for Computational Linguistics (2017)

    Google Scholar 

  3. Calixto, I., Rios, M., Aziz, W.: Latent variable model for multi-modal translation. In: ACL (1), pp. 6392–6405. Association for Computational Linguistics (2019)

    Google Scholar 

  4. Chen, S., Jin, Q., Fu, J.: From words to sentences: A progressive learning approach for zero-resource machine translation with visual pivots. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pp. 4932–4938 (2019)

    Google Scholar 

  5. Cheng, Y., et al.: Semi-supervised learning for neural machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) (2016)

    Google Scholar 

  6. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)

    Google Scholar 

  7. Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30k: Multilingual english-german image descriptions. In: Proceedings of the 5th Workshop on Vision and Language, hosted by the 54th Annual Meeting of the Association for Computational Linguistics (VL@ACL), pp. 627–633 (2016)

    Google Scholar 

  8. Gehring, J., Auli, M., Grangier, D., and D.Y.: Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1243–1252 (2017)

    Google Scholar 

  9. Grönroos, S., et al.: The memad submission to the WMT18 multimodal translation task. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers (WMT), pp. 603–611 (2018)

    Google Scholar 

  10. Han, Y., Li, L., Zhang, J.: A coordinated representation learning enhanced multimodal machine translation approach with multi-attention. In: Proceedings of the 2020 on International Conference on Multimedia Retrieval (ICMR), pp. 571–577 (2020)

    Google Scholar 

  11. Helcl, J., Libovický, J., Varis, D.: CUNI system for the WMT18 multimodal translation task. In: Proceedings of the Third Conference on Machine Translation (WMT), pp. 616–623 (2018)

    Google Scholar 

  12. Huang, P., Sun, S., Yang, H.: Image-assisted transformer in zero-resource multi-modal translation. In: Proceedings of the 2021IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7548–7552 (2021)

    Google Scholar 

  13. Ive, J., Madhyastha, P., Specia, L.: Distilling translations with visual awareness. In: ACL, pp. 6525–6538. Association for Computational Linguistics (2019)

    Google Scholar 

  14. Karita, S., Watanabe, S., Iwata, T., Delcroix, M., Ogawa, A., Nakatani, T.: Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6166–6170 (2019)

    Google Scholar 

  15. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the Third International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  16. Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  17. Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation (WMT@ACL), pp. 228–231 (2007)

    Google Scholar 

  18. Li, L., Hu, K., Zheng, Y., Liu, J., Lee, K.A.: Coopnet: Multi-modal cooperative gender prediction in social media user profiling. In: ICASSP, pp. 4310–4314. IEEE (2021)

    Google Scholar 

  19. Li, L., Tayir, T., Hu, K., Zhou, D.: Multi-modal and multi-perspective machine translation by collecting diverse alignments. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds.) PRICAI 2021. LNCS (LNAI), vol. 13032, pp. 311–322. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89363-7_24

    Chapter  Google Scholar 

  20. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 311–318 (2002)

    Google Scholar 

  21. Su, Y., Fan, K., Bach, N., Kuo, C.J., Huang, F.: Unsupervised multi-modal neural machine translation. In: Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10482–10491 (2019)

    Google Scholar 

  22. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31th Conference on Neural Information Processing Systems (NIPS), pp. 5998–6008 (2017)

    Google Scholar 

  23. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 1096–1103 (2008)

    Google Scholar 

  24. Wang, Y., et al.: Semi-supervised neural machine translation via marginal distribution estimation. IEEE ACM Trans. Audio Speech Lang. Process. 27(10), 1564–1576 (2019)

    Google Scholar 

  25. Xu, W., Niu, X., Carpuat, M.: Dual reconstruction: a unifying objective for semi-supervised neural machine translation. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 2006–2020 (2020)

    Google Scholar 

  26. Yao, S., Wan, X.: Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4346–4350 (2020)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (62276196), the Key Research and Development Program of Hubei Province (No. 2021BAA030) and the China Scholarship Council (LiuJinMei [2020] 1509, 202106950041).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, L., Hu, K., Tayir, T., Liu, J., Lee, K.A. (2022). Noise-Robust Semi-supervised Multi-modal Machine Translation. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13630. Springer, Cham. https://doi.org/10.1007/978-3-031-20865-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20865-2_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20864-5

  • Online ISBN: 978-3-031-20865-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics