Abstract
Text image machine translation (TIMT) has been widely used in various real-world applications, which translates source language texts in images into another target language sentence. Existing methods on TIMT are mainly divided into two categories: the recognition-then-translation pipeline model and the end-to-end model. However, how to transfer knowledge from the pipeline model into the end-to-end model remains an unsolved problem. In this paper, we propose a novel Multi-Teacher Knowledge Distillation (MTKD) method to effectively distillate knowledge into the end-to-end TIMT model from the pipeline model. Specifically, three teachers are utilized to improve the performance of the end-to-end TIMT model. The image encoder in the end-to-end TIMT model is optimized with the knowledge distillation guidance from the recognition teacher encoder, while the sequential encoder and decoder are improved by transferring knowledge from the translation sequential and decoder teacher models. Furthermore, both token and sentence-level knowledge distillations are incorporated to better boost the translation performance. Extensive experimental results show that our proposed MTKD effectively improves the text image translation performance and outperforms existing end-to-end and pipeline models with fewer parameters and less decoding time, illustrating that MTKD can take advantage of both pipeline and end-to-end models. Our codes are available at: https://github.com/EriCongMa/MTKD_TIMT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Afli, H., Way, A.: Integrating optical character recognition and machine translation of historical documents. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities, LT4DH@COLING, pp. 109–116 (2016)
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, pp. 4714–4722 (2019)
Chang, Y., Chen, D., Zhang, Y., Yang, J.: An image-based automatic Arabic translation system. Pattern Recognit. 42(9), 2127–2134 (2009)
Chen, J., Cao, H., Natarajan, P.: Integrating natural language processing with image document analysis: what we learned from two real-world applications. Int. J. Document Anal. Recognit. 18(3), 235–247 (2015)
Chen, Z., Yin, F., Yang, Q., Liu, C.L.: Cross-lingual text image recognition via multi-hierarchy cross-modal mimic. IEEE Trans. Multimedia (TMM), pp. 1–13 (2022)
Chen, Z., Yin, F., Zhang, X., Yang, Q., Liu, C.: Cross-lingual text image recognition via multi-task sequence to sequence learning. In: 25th International Conference on Pattern Recognition (ICPR), pp. 3122–3129 (2020)
Du, J., Huo, Q., Sun, L., Sun, J.: Snap and translate using windows phone. In: 2011 International Conference on Document Analysis and Recognition (ICDAR), pp. 809–813. IEEE Computer Society (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Hinami, R., Ishiwatari, S., Yasuda, K., Matsui, Y.: Towards fully automated manga translation. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI) (2021)
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR abs/1503.02531 (2015)
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, 1–4 November 2016, pp. 1317–1327. The Association for Computational Linguistics (2016)
Liu, Y., et al.: End-to-end speech translation with knowledge distillation. In: Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 1128–1132. ISCA (2019)
Ma, C., et al.: Improving end-to-end text image translation from the auxiliary text translation task. In: 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, 21–25 August 2022, pp. 1664–1670. IEEE (2022)
Mansimov, E., Stern, M., Chen, M., Firat, O., Uszkoreit, J., Jain, P.: Towards end-to-end in-image neural machine translation. In: Proceedings of the First International Workshop on Natural Language Processing Beyond Text, pp. 70–74. Association for Computational Linguistics, Online (Nov 2020)
Shekar, K.C., Cross, M.A., Vasudevan, V.: Optical character recognition and neural machine translation using deep learning techniques. In: Saini, H.S., Sayal, R., Govardhan, A., Buyya, R. (eds.) Innovations in Computer Science and Engineering. LNNS, vol. 171, pp. 277–283. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4543-0_30
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 4168–4176. IEEE Computer Society (2016)
Su, T., Liu, S., Zhou, S.: RTNet: an end-to-end method for handwritten text image translation. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 99–113. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_7
Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhao, T.: Knowledge distillation for multilingual unsupervised neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020, pp. 3525–3535 (2020)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8–13 December 2014, Montreal, Quebec, Canada, pp. 3104–3112 (2014)
Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.: Multilingual neural machine translation with knowledge distillation. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Watanabe, Y., Okada, Y., Kim, Y., Takeda, T.: Translation camera. In: Fourteenth International Conference on Pattern Recognition, ICPR 1998, Brisbane, Australia, 16–20 August 1998, pp. 613–617 (1998)
Weinzaepfel, P., Brégier, R., Combaluzier, H., Leroy, V., Rogez, G.: DOPE: distillation of part experts for whole-body 3D pose estimation in the wild. In: Computer Vision - ECCV 2020–16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXVI. vol. 12371, pp. 380–397 (2020)
Wong, F., Chao, S., Chan, W.K.: Cyclops - snapshot translation system based on mobile device. J. Softw. 6(9), 1664–1671 (2011)
Yang, J., Chen, X., Zhang, J., Zhang, Y., Waibel, A.: Automatic detection and translation of text from natural scenes. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, 13–17 May 2002, Orlando, Florida, USA, pp. 2101–2104 (2002)
Zhang, Y., Nie, S., Liang, S., Liu, W.: Bidirectional adversarial domain adaptation with semantic consistency. In: Lin, Z. (ed.) PRCV 2019. LNCS, vol. 11859, pp. 184–198. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31726-3_16
Zhang, Y., Nie, S., Liang, S., Liu, W.: Robust text image recognition via adversarial sequence-to-sequence domain adaptation. IEEE Trans. Image Process. 30, 3922–3933 (2021)
Zhao, Y., Xiang, L., Zhu, J., Zhang, J., Zhou, Y., Zong, C.: Knowledge graph enhanced neural machine translation via multi-task learning on sub-entity granularity. In: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, 8–13 December 2020, pp. 4495–4505 (2020)
Zhao, Y., Zhang, J., Zhou, Y., Zong, C.: Knowledge graphs enhanced neural machine translation. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp. 4039–4045 (2020)
Acknowledgement
This work has been supported by the National Natural Science Foundation of China (NSFC) grants 62106265.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ma, C., Zhang, Y., Tu, M., Zhao, Y., Zhou, Y., Zong, C. (2023). Multi-teacher Knowledge Distillation for End-to-End Text Image Machine Translation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-41676-7_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41675-0
Online ISBN: 978-3-031-41676-7
eBook Packages: Computer ScienceComputer Science (R0)