Elsevier

Pattern Recognition

Volume 98, February 2020, 107026
Pattern Recognition

Realtime multi-scale scene text detection with scale-based region proposal network

https://doi.org/10.1016/j.patcog.2019.107026Get rights and content

Highlights

  • We propose a novel network named SRPN to realize both text/non-text localization and scale estimation efficiently.

  • A two-stage detection scheme based on SRPN is proposed to avoid using multi-scale pyramid input and achieve faster detection speed.

  • The proposed method achieves remarkable speedup on ICDAR2015, ICDAR2013 and MSRA-TD500 while keeping competitive performance.

  • Ablation experiments are given to prove reasonableness of the proposed method from different aspects.

Abstract

Multi-scale approaches have been widely used for achieving high accuracy for scene text detection, but they usually slow down the speed of the whole system. In this paper, we propose a two-stage framework for realtime multi-scale scene text detection. The first stage employs a novel Scale-based Region Proposal Network (SRPN) which can localize text of wide scale range and estimate text scale efficiently. Based on SRPN, non-text regions are filtered out, and text region proposals are generated. Moreover, based on text scale estimation by SRPN, small or big texts in region proposals are resized into a unified normal scale range. The second stage then adopts a Fully Convolutional Network based scene text detector to localize text words from proposals of the first stage. Text detector in the second stage detects texts of narrow scale range but accurately. Since most non-text regions are eliminated through SRPN efficiently, and texts in proposals are properly scaled to avoid multi-scale pyramid processing, the whole system is quite fast. We evaluate both performance and speed of the proposed method on datasets ICDAR2015, ICDAR2013, and MSRA-TD500. On ICDAR2015, our system can reach the state-of-the-art F-measure score of 85.40% at 16.5 fps (frame per second), and competitive performance of 79.66% at 35.1 fps, either of which is more than 5 times faster than previous best methods. On ICDAR2013 and MSRA-TD500, we also achieve remarkable speedup by keeping competitive performance. Ablation experiments are also provided to demonstrate the reasonableness of our method.

Introduction

Scene text detection is widely studied in computer vision community as it is one of the fundamental steps of scene understanding. Owing to the significant achievements of deep convolutional neural network (CNN) [1] based generic object detection methods like Faster-RCNN [2], SSD [3] and FPN [4], scene text detection methods [5], [6], [7], [8], [9], [10] have surpassed traditional approaches by large margins. Besides reaching high detection performance, speeding up the detection framework is also necessary for practical uses like detecting text in videos [11], [12]. However, many previous works on fast text detection are either based on manual features [13], [14] or image-level text/non-text classification problem [15], and CNN based scene text detectors for word/line-level are still inefficient in dealing with texts of various scales.

Although methods like those in EAST [5], Link-Seg [6] and TextBox [8] state that scene text detection can run at nearly realtime frame rates, these works are based on single scale testing. To achieve better performance, multi-scale pyramid input is still inevitable and this will make the whole system to be computationally expensive. Previous works like EAST, TextBox and DDRN [7] gain evident improvement under multi-scale testing but consume much more time. Therefore, reducing the running time for multi-scale scene text detection is crucial for high-performance realtime applications.

Our key insight to speedup the multi-scale testing is to avoid using multi-scale pyramid input to detect scene text of various scales. To realize this, we can adopt a text detector which is able to localize text region of wider scale range. However, according to the ablation experiments in Section 4.4.2, detecting wider text scale range would sacrifice the detection performance since it could increase the diversity of text feature. To overcome this, we propose a novel two-stage scene text detection method. A visualized comparison between multi-scale pyramid testing and the proposed method is shown in Fig. 1 for better understanding.

In our method, the first stage uses a novel light-weight network, named Scale-based Region Proposal Network (SRPN) to efficiently and roughly localize text region of wide scale range, and estimate the text scale as well. The SRPN is based on Fully Convolutional Network (FCN) [16] and has three tasks to predict down-sampled pixel-level information of the input image from different aspects. The first task is classification between text and non-text, and the second one is text scale estimation task. In this work, we estimate text scale by a classification task which divides text scale into three categories of Small, Normal and Big based on the shortest side length of a text boundary box, considering that the shortest side is similar to the character scale. Text scale is estimated by classification instead of regression because classification task is easier to train while text scale is not necessarily accurate owing to the scale tolerance of text detector. The third auxiliary task is to determine text boundary to facilitate text region proposal generation. Since SRPN is a light-weight network (2.7Mb in this work), the forward process only takes about 7.8 ms for a 720p (1280 × 720) image. Based on SRPN, most non-text regions could be filtered out, and text region proposals are then cropped out and properly resized.

The second stage employs a detector which detects texts of narrow scale range to predict the text boundary boxes in text region proposals from the first stage. Similar to the SRPN, the scene text detector used in this stage is also derived from FCN by removing the scale estimation task in SRPN. However, different from the light-weight structure of SRPN, to achieve better performance, more convolutional kernels and deeper network are adopted for the second stage.

To evaluate the performance of the proposed method, we choose the benchmarks ICDAR2015 (Incidental Scene Text) [17], ICDAR2013 (Focused Scene dataset) and MSRA-TD500 [18], where texts vary within large scale range and scenes are complex.

In our experiments, we test two text detectors in the second stage to verify the speedup of this two-stage framework. The first detector is fine-tuned from VGG-16 [19] and the second one follows the design of SRPN by keeping the same feature extractor structure. On ICDAR2015, the F-measure score given by two types of detectors are 85.40% and 79.66%, respectively. The VGG-16 based model achieves the state-of-the-art performance at 16.5 fps, which is 5 times faster than previous best results in [20]. The second model also yields competitive performance at 35.1 fps, which definitely satisfies realtime requirement. On ICDAR2013 and MSRA-TD500, we have achieved competitive performance of 88.17% and 80.72%, and high speed of 20.9 fps and 14.6 fps, respectively.

Moreover, we also provide ablation studies to analyze the proposed method from three aspects. The first experiment demonstrates that the two-stage method yields similar or even higher performance compared with traditional multi-scale pyramid based method. The second experiment shows that using a single network to detect text of large scale range yields evidently lower detection performance, and thus justifies the reasonableness of two-stage framework. The third experiment validates the necessity of scale estimation task and auxiliary boundary regression task in SRPN.

In summary, the main contributions of this paper are in four folds:

  • 1.

    We propose a novel network named SRPN to realize both text/non-text localization and scale estimation efficiently;

  • 2.

    A two-stage detection scheme based on SRPN is proposed to avoid using multi-scale pyramid processing and achieve faster detection speed;

  • 3.

    The proposed method achieves remarkable speedup on ICDAR2015, ICDAR2013 and MSRA-TD500 while keeping competitive performance;

  • 4.

    Ablation experiments are conducted to justify reasonableness of the proposed method from different aspects;

The rest of this paper is organized as follows: Section 2 gives a brief review of recent works on scene text detection, together with works on scale estimation and network acceleration; Section 3 describes details of the proposed method; Section 4 provides experimental results and ablation studies; Section 5 concludes this paper.

Section snippets

Related work

Early works on scene text detection were usually based on manually designed features like Maximally Stable Extremal Regions (MSER) [21], [22], Stroke Width Transformation (SWT) [23], EdgeBox [24] or other types [25], [26]. Even though the works in [27], [28] adopt deep neural networks, the base feature is still based on MSER. Deep learning based methods like [5], [6], [7], [8], [9], [10], [29], [30] can learn discriminative features in end-to-end manner, and have significantly promoted the

The proposed method

As shown in Fig. 2, our method first uses a SRPN to predict text regions and the corresponding scales. Based on the prediction of SRPN, text region proposals are cropped out and resized into proper scales. After that, a FCN based text detector is used to localize each word in text region proposals. Finally, we map the text bounding boxes in region proposals back to the original image to get the final result.

Experiments

Experiments are conducted on various datasets containing both focused and incidental texts.

Conclusion

In this paper, we propose a novel two-stage method to achieve realtime multi-scale scene text detection. The first stage uses a novel light-weight network called Scale-based Region Proposal Network (SRPN) for fast localizing text region proposals with text scale estimated. The SRPN can localize text region of wide scale range fast but imprecisely. Based on text scale estimation, text region proposals are properly resized to a unified normal scale, and then text detector in the second stage only

Acknowledgments

This work has been supported by the National Natural Science Foundation of China (NSFC) Grant 61721004, 61411136002, 61733007, 61633021 and NVIDIA NVAIL program.

Wenhao He received the BS degree in communication engineering from Beihang University, Beijing, China in 2013, and PhD degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2019. He was a visiting researcher at University of La Rochelle, in 2017. He is now a senior researcher at Tencent. His research interests include pattern recognition, deep learning, computer vision and scene text detection.

References (51)

  • B. Shi et al.

    Detecting oriented text in natural images by linking segments

    Proc. IEEE Conf. Computer Vision and Pattern Recognition

    (2017)
  • W. He et al.

    Deep direct regression for multi-oriented scene text detection

    Proc. IEEE International Conference on Computer Vision

    (2017)
  • M. Liao et al.

    TextBoxes: a fast text detector with a single deep neural network

    Proc. AAAI Conference on Artificial Intelligence

    (2017)
  • Y. Liu et al.

    Deep matching prior network: toward tighter multi-oriented text detection

    Proc. IEEE Conf. Computer Vision and Pattern Recognition

    (2017)
  • Z. Tian et al.

    Detecting text in natural image with connectionist text proposal network

    Proc. European Conference on Computer Vision

    (2016)
  • L. Neumann et al.

    Real-time lexicon-free scene text localization and recognition

    IEEE Trans. Pattern Anal. Mach.Intell.

    (2016)
  • E.D. Haritaoglu et al.

    Real time image enhancement and segmentation for sign/text detection

    Proc. International Conference on Image Processing

    (2003)
  • J. Long et al.

    Fully convolutional networks for semantic segmentation

    Proc. IEEE Conf. Computer Vision and Pattern Recognition

    (2015)
  • D. Karatzas et al.

    ICDAR 2015 Competition on Robust Reading

    Proc. International Conference on Document Analysis and Recognition

    (2015)
  • C. Yao et al.

    Detecting texts of arbitrary orientations in natural images

    Proc. IEEE Conf. Computer Vision and Pattern Recognition

    (2012)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    Proc. International Conference on Learning Representations

    (2015)
  • D. Deng et al.

    PixelLink: detecting scene text via instance segmentation

    Proc. AAAI Conference on Artificial Intelligence

    (2018)
  • X. Ren et al.

    A novel text structure feature extractor for chinese scene text detection and recognition

    IEEE Access

    (2017)
  • B. Epshtein et al.

    Detecting text in natural scenes with stroke width transform

    Proc. IEEE Conf. Computer Vision and Pattern Recognition

    (2010)
  • M. Jaderberg et al.

    Reading text in the wild with convolutional neural networks

    Int. J. Comput. Vis.

    (2016)
  • Cited by (44)

    • A pyramid input augmented multi-scale CNN for GGO detection in 3D lung CT images

      2023, Pattern Recognition
      Citation Excerpt :

      Ma et al. [22] proposed a deep feature learning modules for object detection to activate multi-scale receptive fields within a much wider range at a single layer level. He et al. [23] proposed a scale-based region proposal network which can localize text of wide scale range to avoid multi-scale pyramid processing. Cheng et al. [24] proposed a multi-scale residual network to process image at different spatial resolutions for image demoireing.

    • An efficient ROI detection algorithm for Bangla text extraction and recognition from natural scene images

      2022, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      Recent development of deep learning based approaches towards text detection from scene images demonstrates its high performance in complex environments, leading to effectiveness and robustness into the problem. Several deep learning based methods have been reported so far to tackle with extremely diverse scene texts (Wang et al., 2019; Zhang et al., 2019; Xue et al., 2019; Kobchaisawat et al., 2020; He et al., 2020 and Ma et al., 2020). According to our analysis, state of the art deep learning approaches are broadly categorized into four groups, (1) regression-based methods, (2) segmentation based methods, (3) hybrid methods, and (4) end to end text spotting.

    • Detection and rectification of arbitrary shaped scene texts by using text keypoints and links

      2022, Pattern Recognition
      Citation Excerpt :

      The CNN-based works detect scene texts via three typical approaches. The first approach [8–11] treats scene texts as a specific class of objects and follows the pipeline of generic object detection that detects word/line-level texts directly by using Faster RCNN [12], etc. This approach either designs text-specific proposals/default boxes to fit the shapes of text instances [2,13,14], or regresses text pixels to the boundaries/vertices of text instances [1,15–17].

    • Non-stationary content-adaptive projector resolution enhancement

      2021, Signal Processing: Image Communication
      Citation Excerpt :

      However, these methods assume a degree of computational and memory resources which are not necessarily available as part of projector hardware, even for relatively simple filtering such as in (4). Despite the recent advancement in tiny deep learning-based models, such as MobileNet [42], SqueezeNet [43], ShuffleNet [44,45] and EfficientNet [46], to tackle several text detection [42,44,46] and motion detection [45] tasks, these methods [30–46] are data driven, which makes the trained learner restricted to specific video categories and often not able to generalize to unseen contents [47–49]. Finally, text-related methods are specifically intended to localize or recognize text [50], whereas for us it is a broader question of high-contrast regions which need to be detected.

    View all citing articles on Scopus

    Wenhao He received the BS degree in communication engineering from Beihang University, Beijing, China in 2013, and PhD degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2019. He was a visiting researcher at University of La Rochelle, in 2017. He is now a senior researcher at Tencent. His research interests include pattern recognition, deep learning, computer vision and scene text detection.

    Xu-Yao Zhang received BS degree in computational mathematics from Wuhan University, Wuhan, China, in 2008 and PhD degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2013. He is currently an associate professor in National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He was a visiting researcher at CENPARMI of Concordia University, in 2012. From March 2015 to March 2016, he was a visiting scholar in Montreal Institute for Learning Algorithms (MILA), University of Montreal, Canada. His research interests include machine learning, pattern recognition, handwriting recognition, and deep learning.

    Fei Yin received the BS degree in computer science from Xidian University of Posts and Telecommunications, Xi’an, China, in 1999, the ME degree in pattern recognition and intelligent systems from Huazhong University of Science and Technology, Wuhan, China, in 2002, and the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2010. He is an associate professor in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include document image analysis, handwritten character recognition, and image processing. He has published more than 50 papers at international journals and conferences.

    Zhenbo Luo received the Master degree in Electronics Engineering from Tsinghua University in 2006, and the supervisor is Professor Xiaoqing Ding. Received the B.Sc. degree in Electronics Engineering from Fudan University in 2003. He is a Principal Engineer and lab leader of Machine learning lab of Samsung R&D Institute China. His research interests include vision and learning. He leads the team to develop GAN, text recognition, AR, Human understanding technologies for Samsung smart phones, visual display and printing business.

    Jean-Marc Ogier received his PhD degree in computer science from the University of Rouen, France, in 1994. During this period (1991–1994), he worked on graphic recognition for Matra Ms&I Company. Now full professor at the university of la Rochelle, Pr Ogier was the head of L3I laboratory in computer sciences of the university of la Rochelle which gathers more than 120 members and works mainly of Document Analysis and Content Management. Author of more than 200 publications / communications, he managed several French and European projects dealing with document analysis, either with public institutions, or with private companies. Pr Ogier was Deputy Director of the GDR I3 of the French National Research Centre (CNRS) between 2005 and 2013. He was also Chair of the Technical Committee 10 (Graphic Recognition) of the International Association for Pattern Recognition (IAPR) from 2010 to 2015, and is the representative member of France at the governing board of the IAPR. He is now the general chair of the TC6, dealing with computational forensics of the International Association for Pattern Recognition. Jean-Marc Ogier has been the general chair or the program chair of several international scientific events dealing with document analysis (DAS, ICDAR, GREC,... ) He is now the president of the university of La Rochelle

    Cheng-Lin Liu is a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing, China, and is now the director of the laboratory. He received the B.S. degree in electronic engineering from Wuhan University, Wuhan, China, the M.E. degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in pattern recognition and intelligent control from the Institute of Automation of Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 200 technical papers at prestigious international journals and conferences. He is an associate editor-in-chief of Pattern Recognition, and is on the editorial board of Image and Vision Computing, International Journal on Document Analysis and Recognition, and Cognitive Computation. He is a fellow of the IEEE and the IAPR.

    View full text