research-article

Lifelong Scene Text Recognizer via Expert Modules

Authors:

Jie QinAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 1821 - 1830

https://doi.org/10.1145/3581783.3612062

Published: 27 October 2023 Publication History

Abstract

Scene text recognition (STR) has been actively studied in recent years, with a wide range of applications in autonomous driving, image retrieval and much more. However, when a pre-trained deep STR model learns a new task, its performance on previous tasks may drop dramatically, due to catastrophic forgetting in deep neural networks. A potential solution to combat the forgetting of prior knowledge is incremental learning (IL), which has shown its effectiveness and significant progress in image classification. Yet, exploiting IL in the context of STR has been barely visited, probably because the forgetting problem is even worse in STR. To address this issue, we propose the lifelong scene text recognizer (LSTR) that learns STR tasks incrementally while alleviating forgetting. Specifically, LSTR assigns each task a set of task-specific expert modules at different stages of an STR model, while other parameters are shared among tasks. These shared parameters are only learned in the first task and remain unchanged during subsequent learning to ensure that no learned knowledge is overlooked. Moreover, in real applications, there is no prior knowledge about which task an input image belongs to, making it impossible to precisely select the corresponding expert modules. To this end, we propose the incremental task prediction network (ITPN) to identify the most related task category by pulling the features of the same task closer and pushing those of different tasks farther apart. To validate the proposed method in our newly-introduced IL setting, we collected a large-scale dataset consisting of both real and synthetic multilingual STR data. Extensive experiments on this dataset clearly show the superiority of our LSTR over state-of-the-art IL methods.

References

[1]

Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. 2021. SS-IL: Separated Softmax for Incremental Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 844--853.

[2]

Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. 2017. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3366--3375.

[3]

Rowel Atienza. 2021. Vision transformer for fast and efficient scene text recognition. In International Conference on Document Analysis and Recognition. Springer, 319--334.

Digital Library

[4]

Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision. 4715--4723.

[5]

Jeonghun Baek, Yusuke Matsui, and Kiyoharu Aizawa. 2021. What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3113--3122.

[6]

Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 71--79.

Digital Library

[7]

Michal Buvs ta, Yash Patel, and Jiri Matas. 2019. E2e-mlt-an unconstrained end-to-end method for multi-language scene text. In Computer Vision-ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, December 2-6, 2018, Revised Selected Papers 14. Springer, 127--143.

[8]

Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. 2018. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV). 532--547.

Digital Library

[9]

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. 2022. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. arXiv preprint arXiv:2205.13535 (2022).

[10]

Zhuo Chen, Fei Yin, Xu-Yao Zhang, Qing Yang, and Cheng-Lin Liu. 2020. MuLTReNets: Multilingual text recognition networks for simultaneous script identification and handwriting recognition. Pattern Recognition, Vol. 108 (2020), 107555.

[11]

Changxu Cheng, Qiuhui Huang, Xiang Bai, Bin Feng, and Wenyu Liu. 2019. Patch aggregator for scene text script identification. In 2019 international conference on document analysis and recognition (ICDAR). IEEE, 1077--1083.

[12]

Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. 2019. Learning without memorizing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5138--5146.

[13]

Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. 2020. Podnet: Pooled outputs distillation for small-tasks incremental learning. In European Conference on Computer Vision. Springer, 86--102.

Digital Library

[14]

Arthur Douillard, Alexandre Ramé, Guillaume Couairon, and Matthieu Cord. 2022. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9285--9295.

[15]

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7098--7107.

[16]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.

Digital Library

[17]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[18]

Wenyang Hu, Xiaocong Cai, Jun Hou, Shuai Yi, and Zhiping Lin. 2020. Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11005--11012.

[19]

Jing Huang, Guan Pang, Rama Kovvuri, Mandy Toh, Kevin J Liang, Praveen Krishnan, Xi Yin, and Tal Hassner. 2021. A multiplexed network for end-to-end, multilingual ocr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4547--4557.

[20]

Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501--1510.

[21]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. PMLR, 448--456.

[22]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. Advances in neural information processing systems, Vol. 28 (2015).

[23]

KJ Joseph, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Vineeth N Balasubramanian. 2022. Energy-based Latent Aligner for Incremental Learning. arXiv preprint arXiv:2203.14952 (2022).

[24]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, Vol. 114, 13 (2017), 3521--3526.

[25]

Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8610--8617.

Digital Library

[26]

Yongrui Li, Shilian Wu, Jun Yu, and Zengfu Wang. 2021. Fine-Grained Language Identification in Scene Text Images. In Proceedings of the 29th ACM International Conference on Multimedia. 4573--4581.

Digital Library

[27]

Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 12 (2017), 2935--2947.

[28]

Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: a spatial attention residue network for scene text recognition. In BMVC, Vol. 2. 7.

[29]

Xialei Liu, Marc Masana, Luis Herranz, Joost Van de Weijer, Antonio M Lopez, and Andrew D Bagdanov. 2018. Rotate your networks: Better weight consolidation and less catastrophic forgetting. In 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2262--2268.

[30]

Shangbang Long and Cong Yao. 2020. Unrealtext: Synthesizing realistic scene text images from the unreal world. arXiv preprint arXiv:2003.10608 (2020).

[31]

David Lopez-Paz and Marc'Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, Vol. 30 (2017).

[32]

Ning Lu, Wenwen Yu, Xianbiao Qi, Yihao Chen, Ping Gong, Rong Xiao, and Xiang Bai. 2021. Master: Multi-aspect non-local network for scene text recognition. Pattern Recognition, Vol. 117 (2021), 107980.

[33]

Canjie Luo, Lianwen Jin, and Jingdong Chen. 2022. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1039--1048.

[34]

Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin Liu, et al. 2019. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In 2019 International conference on document analysis and recognition (ICDAR). IEEE, 1582--1587.

[35]

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 1454--1459.

[36]

Oren Nuriel, Sagie Benaim, and Lior Wolf. 2021a. Permuted adain: Reducing the bias towards global statistics in image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9482--9491.

[37]

Oren Nuriel, Sharon Fogel, and Ron Litman. 2021b. Textadain: Fine-grained adain for robust text recognition. arXiv preprint arXiv:2105.03906 (2021).

[38]

Zhi Qiao, Yu Zhou, Jin Wei, Wei Wang, Yuan Zhang, Ning Jiang, Hongbin Wang, and Weiping Wang. 2021. PIMNet: a parallel, iterative and mimicking network for scene text recognition. In Proceedings of the 29th ACM International Conference on Multimedia. 2046--2055.

Digital Library

[39]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2001--2010.

[40]

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016).

[41]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.

[42]

Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 11 (2016), 2298--2304.

Digital Library

[43]

Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 9 (2018), 2035--2048.

[44]

Baoguang Shi, Cong Yao, Minghui Liao, Mingkun Yang, Pei Xu, Linyan Cui, Serge Belongie, Shijian Lu, and Xiang Bai. 2017. Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th iapr international conference on document analysis and recognition (ICDAR), Vol. 1. IEEE, 1429--1434.

[45]

Pravendra Singh, Vinay Kumar Verma, Pratik Mazumder, Lawrence Carin, and Piyush Rai. 2020. Calibrating cnns for lifelong learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 15579--15590.

[46]

Yew Lee Tan, Adams Wai-Kin Kong, and Jung-Jae Kim. 2022. Pure Transformer with Integrated Experts for Scene Text Recognition. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII. Springer, 481--497.

[47]

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016).

[48]

Vinay Kumar Verma, Kevin J Liang, Nikhil Mehta, Piyush Rai, and Lawrence Carin. 2021. Efficient feature transformations for discriminative and generative continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13865--13875.

[49]

Eli Verwimp, Matthias De Lange, and Tinne Tuytelaars. 2021. Rehearsal revealed: The limits and merits of revisiting samples in continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9385--9394.

[50]

Zhaoyi Wan, Minghang He, Haoran Chen, Xiang Bai, and Cong Yao. 2020. Textscanner: Reading characters in order for robust scene text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12120--12127.

[51]

Peng Wang, Cheng Da, and Cong Yao. 2022. Multi-granularity Prediction for Scene Text Recognition. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII. Springer, 339--355.

[52]

Tianwei Wang, Yuanzhi Zhu, Lianwen Jin, Canjie Luo, Xiaoxue Chen, Yaqiang Wu, Qianying Wang, and Mingxiang Cai. 2020. Decoupled attention network for text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12216--12224.

[53]

Mengqi Xue, Haofei Zhang, Jie Song, and Mingli Song. 2022. Meta-attention for ViT-backed Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 150--159.

[54]

Lu Yang, Peng Wang, Hui Li, Zhen Li, and Yanning Zhang. 2020. A holistic representation guided attention network for scene text recognition. Neurocomputing, Vol. 414 (2020), 67--75.

[55]

Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. 2020. Scalable and Order-robust Continual Learning with Additive Parameter Decomposition. In Eighth International Conference on Learning Representations, ICLR 2020. ICLR.

[56]

Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. 2017. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547 (2017).

[57]

Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020a. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12113--12122.

[58]

Lu Yu, Bartlomiej Twardowski, Xialei Liu, Luis Herranz, Kai Wang, Yongmei Cheng, Shangling Jui, and Joost van de Weijer. 2020b. Semantic drift compensation for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6982--6991.

[59]

Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).

[60]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6848--6856.

Cited By

Liu XLiu MChen ZLuo XXu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Hierarchical Multi-label Learning for Incremental Multilingual Text RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681350(8750-8758)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681350

Index Terms

Lifelong Scene Text Recognizer via Expert Modules
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information systems applications

Recommendations

Lifelong Machine Learning
Accelerating Lifelong Reinforcement Learning via Reshaping Rewards*
2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
The reinforcement learning (RL) problem is typically formalized as the Markov Decision Process (MDP), where an agent interacts with the environment to maximize the long-term expected reward. As an important branch of RL, Lifelong RL requires the agent to ...
Lifelong Machine Learning

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Guangxi Science and Technology Project
National Natural Science Foundation of China
Natural Science Foundation of Jiangsu Province

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
143
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu XLiu MChen ZLuo XXu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Hierarchical Multi-label Learning for Incremental Multilingual Text RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681350(8750-8758)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681350

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten