Abstract
Scene Text Recognition (STR) has many important applications in computer vision. Complex backgrounds continue to be a big challenge for STR because they interfere with text feature extraction. Many existing methods use attentional regions, bounding boxes or polygons to reduce such interference. However, the text regions located by these methods still contain much undesirable background interference. In this paper, we propose a Background-Insensitive approach BINet by explicitly leveraging the text Semantic Segmentation (SSN) to extract texts more accurately. SSN is trained on a set of existing segmentation data, whose volume is only 0.03% of STR training data. This prevents the large-scale pixel-level annotations of the STR training data. To effectively utilize the segmentation cues, we design new segmentation refinement and embedding blocks for refining text-masks and reinforcing visual features. Additionally, we propose an efficient pipeline that utilizes Synthetic Initialization (SI) for STR models trained only on real data (1.7% of STR training data), instead of on both synthetic and real data from scratch. Experiments show that the proposed method can recognize text from complex backgrounds more effectively, achieving state-of-the-art performance on several public datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Zaidy, R., Fung, B.C., Youssef, A.M., Fortin, F.: Mining criminal networks from unstructured text documents. Digit. Investig. 8(3–4), 147–160 (2012)
Alsharif, O., Pineau, J.: End-to-end text recognition with hybrid hmm maxout models. arXiv preprint arXiv:1310.1811 (2013)
Atienza, R.: Vision transformer for fast and efficient scene text recognition. arXiv preprint arXiv:2105.08582 (2021)
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4715–4723 (2019)
Baek, J., Matsui, Y., Aizawa, K.: What if we only use real datasets for scene text recognition? Toward scene text recognition with fewer labels. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3113–3122 (2021)
Bao, W., Lai, W.S., Ma, C., Zhang, X., Gao, Z., Yang, M.H.: Depth-aware video frame interpolation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3703–3712 (2019)
Bartz, C., Bethge, J., Yang, H., Meinel, C.: Kiss: keeping it simple for scene text recognition. arXiv preprint arXiv:1911.08400 (2019)
Bau, D., et al.: Seeing what a GAN cannot generate. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4502–4511 (2019)
Bhunia, A.K., Sain, A., Kumar, A., Ghose, S., Chowdhury, P.N., Song, Y.Z.: Joint visual semantic reasoning: multi-stage decoder for text recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14940–14949 (2021)
Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotooCR: reading text in uncontrolled conditions. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 785–792 (2013)
Chen, X., Wang, T., Zhu, Y., Jin, L., Luo, C.: Adaptive embedding gate for attention-based scene text recognition. Neurocomputing 381, 261–271 (2020)
Chen, Y., Li, V.O., Cho, K., Bowman, S.R.: A stable and effective learning strategy for trainable greedy decoding. arXiv preprint arXiv:1804.07915 (2018)
Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S.: Aon: towards arbitrarily-oriented text recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5571–5579 (2018)
Ch’ng, C.K., Chan, C.S.: Total-text: a comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 935–942. IEEE (2017)
Chng, C.K., et al.: ICDAR 2019 robust reading challenge on arbitrary-shaped text-RRC-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1571–1576. IEEE (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Diaz-Escobar, J., Kober, V.: Natural scene text detection and segmentation using phase-based regions and character retrieval. In: Mathematical Problems in Engineering 2020 (2020)
Engelmann, F., Kontogianni, T., Hermans, A., Leibe, B.: Exploring spatial context for 3D semantic segmentation of point clouds. In: IEEE International Conference on Computer Vision workshops, pp. 716–724 (2017)
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7098–7107 (2021)
Fang, S., Xie, H., Zha, Z.J., Sun, N., Tan, J., Zhang, Y.: Attention and language ensemble for scene text recognition with convolutional sequence modeling. In: ACM International Conference on Multimedia, pp. 248–256 (2018)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 27 (2014)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2315–2324 (2016)
Hong, T., Hull, J.J.: Visual inter-word relations and their use in OCR postprocessing. In: Proceedings of 3rd International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 442–445. IEEE (1995)
Hu, W., Cai, X., Hou, J., Yi, S., Lin, Z.: GTC: guided training of CTC towards efficient and accurate scene text recognition. In: Association for the Advancement of Artificial Intelligence (AAAI), vol. 34, pp. 11005–11012 (2020)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition. arXiv preprint arXiv:1412.5903 (2014)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vision (IJCV) 116(1), 1–20 (2016)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_34
Jung, S., Lee, U., Jung, J., Shim, D.H.: Real-time traffic sign recognition system with deep convolutional neural network. In: International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp. 31–34. IEEE (2016)
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1484–1493. IEEE (2013)
Krishnan, P., Kovvuri, R., Pang, G., Vassilev, B., Hassner, T.: Textstylebrush: transfer of text aesthetics from a single example. arXiv preprint arXiv:2106.08385 (2021)
Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3D reconstruction from monocular video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 703–718. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_45
Laina, I., Rupprecht, C., Navab, N.: Towards unsupervised image captioning with shared multimodal embeddings. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7414–7424 (2019)
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2231–2239 (2016)
Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, International Conference on Machine Learning (ICML), vol. 3, p. 896 (2013)
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: Association for the Advancement of Artificial Intelligence (AAAI), vol. 33, pp. 8610–8617 (2019)
Liao, M., Pang, G., Huang, J., Hassner, T., Bai, X.: Mask TextSpotter v3: segmentation proposal network for robust scene text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 706–722. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_41
Liao, M., et al.: Scene text recognition from two-dimensional perspective. In: Association for the Advancement of Artificial Intelligence (AAAI), vol. 33, pp. 8714–8721 (2019)
Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., Manmatha, R.: Scatter: selective context attentional scene text recognizer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11962–11972 (2020)
Liu, W., Chen, C., Wong, K.Y.K.: Char-net: A character-aware neural network for distorted scene text recognition. In: Association for the Advancement of Artificial Intelligence (AAAI) (2018)
Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. In: British Machine Vision Conference (BMVC), vol. 2, p. 7 (2016)
Liu, X., Kawanishi, T., Wu, X., Kashino, K.: Scene text recognition with CNN classifier and WFST-based word labeling. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3999–4004. IEEE (2016)
Looije, R., Neerincx, M.A., Cnossen, F.: Persuasive robotic assistant for health self-management of older adults: design and evaluation of social behaviors. Int. J. Hum.-Comput. Stud. (IJHCS) 68(6), 386–397 (2010)
Luo, C., Lin, Q., Liu, Y., Jin, L., Shen, C.: Separating content from style using adversarial learning for recognizing text in the wild. Int. J. Comput. Vision (IJCV) 129(4), 960–976 (2021)
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: British Machine Vision Conference (BMVC). BMVA (2012)
Mishra, A., Alahari, K., Jawahar, C.: Enhancing energy minimization framework for scene text recognition with top-down cues. Comput. Vision Image Underst. (CVIU) 145, 30–42 (2016)
Mou, Y., et al.: PlugNet: degradation aware scene text recognition supervised by a pluggable super-resolution unit. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 158–174. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_10
Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1582–1587. IEEE (2019)
Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6494, pp. 770–783. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19318-7_60
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 569–576 (2013)
Qiao, Z., et al.: PimNet: a parallel, iterative and mimicking network for scene text recognition. In: ACM International Conference on Multimedia, pp. 2046–2055 (2021)
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13528–13537 (2020)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning (ICML), pp. 8821–8831. PMLR (2021)
Ren, W., et al.: Deep video dehazing with semantic segmentation. IEEE Trans. Image Process. (TIP) 28(4), 1895–1908 (2018)
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39(11), 2298–2304 (2016)
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4168–4176 (2016)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41(9), 2035–2048 (2018)
Shi, B., et al.: ICDAR 2017 competition on reading Chinese text in the wild (RCTW-17). In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1429–1434. IEEE (2017)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: IEEE/CVF International Conference on Computer Vision (ICCV), vol. 3, pp. 1470–1470. IEEE Computer Society (2003)
Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 35–48. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16865-4_3
Sun, Y., et al.: ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1557–1562. IEEE (2019)
Tchapmi, L., Choy, C., Armeni, I., Gwak, J., Savarese, S.: SegCloud: semantic segmentation of 3D point clouds. In: 2017 International Conference on 3D Vision (3DV), pp. 537–547. IEEE (2017)
Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zero-shot image-to-text generation for visual-semantic arithmetic. arXiv preprint arXiv:2111.14447 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008 (2017)
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Wan, Z., He, M., Chen, H., Bai, X., Yao, C.: TextScanner: reading characters in order for robust scene text recognition. In: Association for the Advancement of Artificial Intelligence (AAAI), vol. 34, pp. 12120–12127 (2020)
Wang, J., Li, X., Yang, J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1788–1797 (2018)
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) (2020)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1457–1464. IEEE (2011)
Wang, K., Belongie, S.: Word spotting in the wild. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 591–604. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_43
Wang, S., Wang, Y., Qin, X., Zhao, Q., Tang, Z.: Scene text recognition via gated cascade attention. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1018–1023. IEEE (2019)
Wang, T., et al.: Decoupled attention network for text recognition. In: Association for the Advancement of Artificial Intelligence (AAAI), vol. 34, pp. 12216–12224 (2020)
Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 606–615 (2018)
Xu, X., Zhang, Z., Wang, Z., Price, B., Wang, Z., Shi, H.: Rethinking text segmentation: a novel dataset and a text-specific refinement approach. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12045–12055 (2021)
Yan, R., Peng, L., Xiao, S., Yao, G.: Primitive representation learning for scene text recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 284–293 (2021)
Yang, M., et al.: Symmetry-constrained rectification network for scene text recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9147–9156 (2019)
Yang, X., He, D., Zhou, Z., Kifer, D., Giles, C.L.: Learning to read irregular text with attention mechanisms. In: International Joint Conference on Artificial Intelligence (IJCAI), vol. 1, p. 3 (2017)
Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4042–4049 (2014)
Ye, J., Chen, Z., Liu, J., Du, B.: TextFuseNet: scene text detection with richer fused features. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 516–522 (2020)
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12113–12122 (2020)
Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9
Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2059–2068 (2019)
Zhang, H., Yao, Q., Yang, M., Xu, Y., Bai, X.: AutoSTR: efficient backbone search for scene text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 751–767. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_44
Zhang, R., et al.: ICDAR 2019 robust reading challenge on reading Chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1577–1581. IEEE (2019)
Zhang, X., Wei, Y., Yang, Y., Huang, T.S.: SG-ONE: similarity guidance network for one-shot semantic segmentation. IEEE Trans. Cybern. 50(9), 3855–3865 (2020)
Zhang, Y., Nie, S., Liu, W., Xu, X., Zhang, D., Shen, H.T.: Sequence-to-sequence domain adaptation network for robust text image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2740–2749 (2019)
Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., Kadlec, B.: Uber-text: a large-scale dataset for optical character recognition from street-level imagery. In: IEEE International Conference on Computer Vision workshops, vol. 2017, p. 5 (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2223–2232 (2017)
Zhu, Y., Wang, S., Huang, Z., Chen, K.: Text recognition in images based on transformer with hierarchical attention. In: IEEE International Conference on Image Processing (ICIP), pp. 1945–1949. IEEE (2019)
Acknowledgment
The work is supported by XSEDE Program of National Science Foundation, and Aspire-II Research Program in University of South Carolina. This work used GPUs provided by the NSF MRI-2018966.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, L., Wu, Z., Wu, X., Wilsbacher, G., Wang, S. (2022). Background-Insensitive Scene Text Recognition with Text Semantic Segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13685. Springer, Cham. https://doi.org/10.1007/978-3-031-19806-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-19806-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19805-2
Online ISBN: 978-3-031-19806-9
eBook Packages: Computer ScienceComputer Science (R0)