Multi-teacher Knowledge Distillation for End-to-End Text Image Machine Translation

Ma, Cong; Zhang, Yaping; Tu, Mei; Zhao, Yang; Zhou, Yu; Zong, Chengqing

doi:10.1007/978-3-031-41676-7_28

Cong Ma^11,12,
Yaping Zhang^11,12,
Mei Tu¹⁴,
Yang Zhao^11,12,
Yu Zhou^12,13 &
…
Chengqing Zong^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14187))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1164 Accesses

Abstract

Text image machine translation (TIMT) has been widely used in various real-world applications, which translates source language texts in images into another target language sentence. Existing methods on TIMT are mainly divided into two categories: the recognition-then-translation pipeline model and the end-to-end model. However, how to transfer knowledge from the pipeline model into the end-to-end model remains an unsolved problem. In this paper, we propose a novel Multi-Teacher Knowledge Distillation (MTKD) method to effectively distillate knowledge into the end-to-end TIMT model from the pipeline model. Specifically, three teachers are utilized to improve the performance of the end-to-end TIMT model. The image encoder in the end-to-end TIMT model is optimized with the knowledge distillation guidance from the recognition teacher encoder, while the sequential encoder and decoder are improved by transferring knowledge from the translation sequential and decoder teacher models. Furthermore, both token and sentence-level knowledge distillations are incorporated to better boost the translation performance. Extensive experimental results show that our proposed MTKD effectively improves the text image translation performance and outperforms existing end-to-end and pipeline models with fewer parameters and less decoding time, illustrating that MTKD can take advantage of both pipeline and end-to-end models. Our codes are available at: https://github.com/EriCongMa/MTKD_TIMT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/mjpost/sacrebleu.

References

Afli, H., Way, A.: Integrating optical character recognition and machine translation of historical documents. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities, LT4DH@COLING, pp. 109–116 (2016)
Google Scholar
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, pp. 4714–4722 (2019)
Google Scholar
Chang, Y., Chen, D., Zhang, Y., Yang, J.: An image-based automatic Arabic translation system. Pattern Recognit. 42(9), 2127–2134 (2009)
Article MATH Google Scholar
Chen, J., Cao, H., Natarajan, P.: Integrating natural language processing with image document analysis: what we learned from two real-world applications. Int. J. Document Anal. Recognit. 18(3), 235–247 (2015)
Article Google Scholar
Chen, Z., Yin, F., Yang, Q., Liu, C.L.: Cross-lingual text image recognition via multi-hierarchy cross-modal mimic. IEEE Trans. Multimedia (TMM), pp. 1–13 (2022)
Google Scholar
Chen, Z., Yin, F., Zhang, X., Yang, Q., Liu, C.: Cross-lingual text image recognition via multi-task sequence to sequence learning. In: 25th International Conference on Pattern Recognition (ICPR), pp. 3122–3129 (2020)
Google Scholar
Du, J., Huo, Q., Sun, L., Sun, J.: Snap and translate using windows phone. In: 2011 International Conference on Document Analysis and Recognition (ICDAR), pp. 809–813. IEEE Computer Society (2011)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hinami, R., Ishiwatari, S., Yasuda, K., Matsui, Y.: Towards fully automated manga translation. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI) (2021)
Google Scholar
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR abs/1503.02531 (2015)
Google Scholar
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, 1–4 November 2016, pp. 1317–1327. The Association for Computational Linguistics (2016)
Google Scholar
Liu, Y., et al.: End-to-end speech translation with knowledge distillation. In: Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 1128–1132. ISCA (2019)
Google Scholar
Ma, C., et al.: Improving end-to-end text image translation from the auxiliary text translation task. In: 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, 21–25 August 2022, pp. 1664–1670. IEEE (2022)
Google Scholar
Mansimov, E., Stern, M., Chen, M., Firat, O., Uszkoreit, J., Jain, P.: Towards end-to-end in-image neural machine translation. In: Proceedings of the First International Workshop on Natural Language Processing Beyond Text, pp. 70–74. Association for Computational Linguistics, Online (Nov 2020)
Google Scholar
Shekar, K.C., Cross, M.A., Vasudevan, V.: Optical character recognition and neural machine translation using deep learning techniques. In: Saini, H.S., Sayal, R., Govardhan, A., Buyya, R. (eds.) Innovations in Computer Science and Engineering. LNNS, vol. 171, pp. 277–283. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4543-0_30
Chapter Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 4168–4176. IEEE Computer Society (2016)
Google Scholar
Su, T., Liu, S., Zhou, S.: RTNet: an end-to-end method for handwritten text image translation. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 99–113. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_7
Chapter Google Scholar
Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhao, T.: Knowledge distillation for multilingual unsupervised neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020, pp. 3525–3535 (2020)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8–13 December 2014, Montreal, Quebec, Canada, pp. 3104–3112 (2014)
Google Scholar
Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.: Multilingual neural machine translation with knowledge distillation. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Watanabe, Y., Okada, Y., Kim, Y., Takeda, T.: Translation camera. In: Fourteenth International Conference on Pattern Recognition, ICPR 1998, Brisbane, Australia, 16–20 August 1998, pp. 613–617 (1998)
Google Scholar
Weinzaepfel, P., Brégier, R., Combaluzier, H., Leroy, V., Rogez, G.: DOPE: distillation of part experts for whole-body 3D pose estimation in the wild. In: Computer Vision - ECCV 2020–16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXVI. vol. 12371, pp. 380–397 (2020)
Google Scholar
Wong, F., Chao, S., Chan, W.K.: Cyclops - snapshot translation system based on mobile device. J. Softw. 6(9), 1664–1671 (2011)
Article Google Scholar
Yang, J., Chen, X., Zhang, J., Zhang, Y., Waibel, A.: Automatic detection and translation of text from natural scenes. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, 13–17 May 2002, Orlando, Florida, USA, pp. 2101–2104 (2002)
Google Scholar
Zhang, Y., Nie, S., Liang, S., Liu, W.: Bidirectional adversarial domain adaptation with semantic consistency. In: Lin, Z. (ed.) PRCV 2019. LNCS, vol. 11859, pp. 184–198. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31726-3_16
Chapter Google Scholar
Zhang, Y., Nie, S., Liang, S., Liu, W.: Robust text image recognition via adversarial sequence-to-sequence domain adaptation. IEEE Trans. Image Process. 30, 3922–3933 (2021)
Article Google Scholar
Zhao, Y., Xiang, L., Zhu, J., Zhang, J., Zhou, Y., Zong, C.: Knowledge graph enhanced neural machine translation via multi-task learning on sub-entity granularity. In: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, 8–13 December 2020, pp. 4495–4505 (2020)
Google Scholar
Zhao, Y., Zhang, J., Zhou, Y., Zong, C.: Knowledge graphs enhanced neural machine translation. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp. 4039–4045 (2020)
Google Scholar

Download references

Acknowledgement

This work has been supported by the National Natural Science Foundation of China (NSFC) grants 62106265.

Author information

Authors and Affiliations

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, People’s Republic of China
Cong Ma, Yaping Zhang, Yang Zhao & Chengqing Zong
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, People’s Republic of China
Cong Ma, Yaping Zhang, Yang Zhao, Yu Zhou & Chengqing Zong
Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd., Beijing, 100190, People’s Republic of China
Yu Zhou
Samsung Research China - Beijing (SRC-B), Beijing, China
Mei Tu

Authors

Cong Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yaping Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mei Tu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Chengqing Zong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaping Zhang .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, C., Zhang, Y., Tu, M., Zhao, Y., Zhou, Y., Zong, C. (2023). Multi-teacher Knowledge Distillation for End-to-End Text Image Machine Translation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-41676-7_28
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41675-0
Online ISBN: 978-3-031-41676-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Multi-teacher Knowledge Distillation for End-to-End Text Image Machine Translation