Skip to main content

Multi-teacher Knowledge Distillation for End-to-End Text Image Machine Translation

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14187))

Included in the following conference series:

  • 1164 Accesses

Abstract

Text image machine translation (TIMT) has been widely used in various real-world applications, which translates source language texts in images into another target language sentence. Existing methods on TIMT are mainly divided into two categories: the recognition-then-translation pipeline model and the end-to-end model. However, how to transfer knowledge from the pipeline model into the end-to-end model remains an unsolved problem. In this paper, we propose a novel Multi-Teacher Knowledge Distillation (MTKD) method to effectively distillate knowledge into the end-to-end TIMT model from the pipeline model. Specifically, three teachers are utilized to improve the performance of the end-to-end TIMT model. The image encoder in the end-to-end TIMT model is optimized with the knowledge distillation guidance from the recognition teacher encoder, while the sequential encoder and decoder are improved by transferring knowledge from the translation sequential and decoder teacher models. Furthermore, both token and sentence-level knowledge distillations are incorporated to better boost the translation performance. Extensive experimental results show that our proposed MTKD effectively improves the text image translation performance and outperforms existing end-to-end and pipeline models with fewer parameters and less decoding time, illustrating that MTKD can take advantage of both pipeline and end-to-end models. Our codes are available at: https://github.com/EriCongMa/MTKD_TIMT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/mjpost/sacrebleu.

References

  1. Afli, H., Way, A.: Integrating optical character recognition and machine translation of historical documents. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities, LT4DH@COLING, pp. 109–116 (2016)

    Google Scholar 

  2. Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, pp. 4714–4722 (2019)

    Google Scholar 

  3. Chang, Y., Chen, D., Zhang, Y., Yang, J.: An image-based automatic Arabic translation system. Pattern Recognit. 42(9), 2127–2134 (2009)

    Article  MATH  Google Scholar 

  4. Chen, J., Cao, H., Natarajan, P.: Integrating natural language processing with image document analysis: what we learned from two real-world applications. Int. J. Document Anal. Recognit. 18(3), 235–247 (2015)

    Article  Google Scholar 

  5. Chen, Z., Yin, F., Yang, Q., Liu, C.L.: Cross-lingual text image recognition via multi-hierarchy cross-modal mimic. IEEE Trans. Multimedia (TMM), pp. 1–13 (2022)

    Google Scholar 

  6. Chen, Z., Yin, F., Zhang, X., Yang, Q., Liu, C.: Cross-lingual text image recognition via multi-task sequence to sequence learning. In: 25th International Conference on Pattern Recognition (ICPR), pp. 3122–3129 (2020)

    Google Scholar 

  7. Du, J., Huo, Q., Sun, L., Sun, J.: Snap and translate using windows phone. In: 2011 International Conference on Document Analysis and Recognition (ICDAR), pp. 809–813. IEEE Computer Society (2011)

    Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  9. Hinami, R., Ishiwatari, S., Yasuda, K., Matsui, Y.: Towards fully automated manga translation. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI) (2021)

    Google Scholar 

  10. Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR abs/1503.02531 (2015)

    Google Scholar 

  11. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, 1–4 November 2016, pp. 1317–1327. The Association for Computational Linguistics (2016)

    Google Scholar 

  12. Liu, Y., et al.: End-to-end speech translation with knowledge distillation. In: Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 1128–1132. ISCA (2019)

    Google Scholar 

  13. Ma, C., et al.: Improving end-to-end text image translation from the auxiliary text translation task. In: 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, 21–25 August 2022, pp. 1664–1670. IEEE (2022)

    Google Scholar 

  14. Mansimov, E., Stern, M., Chen, M., Firat, O., Uszkoreit, J., Jain, P.: Towards end-to-end in-image neural machine translation. In: Proceedings of the First International Workshop on Natural Language Processing Beyond Text, pp. 70–74. Association for Computational Linguistics, Online (Nov 2020)

    Google Scholar 

  15. Shekar, K.C., Cross, M.A., Vasudevan, V.: Optical character recognition and neural machine translation using deep learning techniques. In: Saini, H.S., Sayal, R., Govardhan, A., Buyya, R. (eds.) Innovations in Computer Science and Engineering. LNNS, vol. 171, pp. 277–283. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4543-0_30

    Chapter  Google Scholar 

  16. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)

    Article  Google Scholar 

  17. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 4168–4176. IEEE Computer Society (2016)

    Google Scholar 

  18. Su, T., Liu, S., Zhou, S.: RTNet: an end-to-end method for handwritten text image translation. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 99–113. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_7

    Chapter  Google Scholar 

  19. Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhao, T.: Knowledge distillation for multilingual unsupervised neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020, pp. 3525–3535 (2020)

    Google Scholar 

  20. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8–13 December 2014, Montreal, Quebec, Canada, pp. 3104–3112 (2014)

    Google Scholar 

  21. Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.: Multilingual neural machine translation with knowledge distillation. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019

    Google Scholar 

  22. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  23. Watanabe, Y., Okada, Y., Kim, Y., Takeda, T.: Translation camera. In: Fourteenth International Conference on Pattern Recognition, ICPR 1998, Brisbane, Australia, 16–20 August 1998, pp. 613–617 (1998)

    Google Scholar 

  24. Weinzaepfel, P., Brégier, R., Combaluzier, H., Leroy, V., Rogez, G.: DOPE: distillation of part experts for whole-body 3D pose estimation in the wild. In: Computer Vision - ECCV 2020–16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXVI. vol. 12371, pp. 380–397 (2020)

    Google Scholar 

  25. Wong, F., Chao, S., Chan, W.K.: Cyclops - snapshot translation system based on mobile device. J. Softw. 6(9), 1664–1671 (2011)

    Article  Google Scholar 

  26. Yang, J., Chen, X., Zhang, J., Zhang, Y., Waibel, A.: Automatic detection and translation of text from natural scenes. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, 13–17 May 2002, Orlando, Florida, USA, pp. 2101–2104 (2002)

    Google Scholar 

  27. Zhang, Y., Nie, S., Liang, S., Liu, W.: Bidirectional adversarial domain adaptation with semantic consistency. In: Lin, Z. (ed.) PRCV 2019. LNCS, vol. 11859, pp. 184–198. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31726-3_16

    Chapter  Google Scholar 

  28. Zhang, Y., Nie, S., Liang, S., Liu, W.: Robust text image recognition via adversarial sequence-to-sequence domain adaptation. IEEE Trans. Image Process. 30, 3922–3933 (2021)

    Article  Google Scholar 

  29. Zhao, Y., Xiang, L., Zhu, J., Zhang, J., Zhou, Y., Zong, C.: Knowledge graph enhanced neural machine translation via multi-task learning on sub-entity granularity. In: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, 8–13 December 2020, pp. 4495–4505 (2020)

    Google Scholar 

  30. Zhao, Y., Zhang, J., Zhou, Y., Zong, C.: Knowledge graphs enhanced neural machine translation. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp. 4039–4045 (2020)

    Google Scholar 

Download references

Acknowledgement

This work has been supported by the National Natural Science Foundation of China (NSFC) grants 62106265.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaping Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, C., Zhang, Y., Tu, M., Zhao, Y., Zhou, Y., Zong, C. (2023). Multi-teacher Knowledge Distillation for End-to-End Text Image Machine Translation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41676-7_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41675-0

  • Online ISBN: 978-3-031-41676-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics