Realtime multi-scale scene text detection with scale-based region proposal network
Introduction
Scene text detection is widely studied in computer vision community as it is one of the fundamental steps of scene understanding. Owing to the significant achievements of deep convolutional neural network (CNN) [1] based generic object detection methods like Faster-RCNN [2], SSD [3] and FPN [4], scene text detection methods [5], [6], [7], [8], [9], [10] have surpassed traditional approaches by large margins. Besides reaching high detection performance, speeding up the detection framework is also necessary for practical uses like detecting text in videos [11], [12]. However, many previous works on fast text detection are either based on manual features [13], [14] or image-level text/non-text classification problem [15], and CNN based scene text detectors for word/line-level are still inefficient in dealing with texts of various scales.
Although methods like those in EAST [5], Link-Seg [6] and TextBox [8] state that scene text detection can run at nearly realtime frame rates, these works are based on single scale testing. To achieve better performance, multi-scale pyramid input is still inevitable and this will make the whole system to be computationally expensive. Previous works like EAST, TextBox and DDRN [7] gain evident improvement under multi-scale testing but consume much more time. Therefore, reducing the running time for multi-scale scene text detection is crucial for high-performance realtime applications.
Our key insight to speedup the multi-scale testing is to avoid using multi-scale pyramid input to detect scene text of various scales. To realize this, we can adopt a text detector which is able to localize text region of wider scale range. However, according to the ablation experiments in Section 4.4.2, detecting wider text scale range would sacrifice the detection performance since it could increase the diversity of text feature. To overcome this, we propose a novel two-stage scene text detection method. A visualized comparison between multi-scale pyramid testing and the proposed method is shown in Fig. 1 for better understanding.
In our method, the first stage uses a novel light-weight network, named Scale-based Region Proposal Network (SRPN) to efficiently and roughly localize text region of wide scale range, and estimate the text scale as well. The SRPN is based on Fully Convolutional Network (FCN) [16] and has three tasks to predict down-sampled pixel-level information of the input image from different aspects. The first task is classification between text and non-text, and the second one is text scale estimation task. In this work, we estimate text scale by a classification task which divides text scale into three categories of Small, Normal and Big based on the shortest side length of a text boundary box, considering that the shortest side is similar to the character scale. Text scale is estimated by classification instead of regression because classification task is easier to train while text scale is not necessarily accurate owing to the scale tolerance of text detector. The third auxiliary task is to determine text boundary to facilitate text region proposal generation. Since SRPN is a light-weight network (2.7Mb in this work), the forward process only takes about 7.8 ms for a 720p (1280 × 720) image. Based on SRPN, most non-text regions could be filtered out, and text region proposals are then cropped out and properly resized.
The second stage employs a detector which detects texts of narrow scale range to predict the text boundary boxes in text region proposals from the first stage. Similar to the SRPN, the scene text detector used in this stage is also derived from FCN by removing the scale estimation task in SRPN. However, different from the light-weight structure of SRPN, to achieve better performance, more convolutional kernels and deeper network are adopted for the second stage.
To evaluate the performance of the proposed method, we choose the benchmarks ICDAR2015 (Incidental Scene Text) [17], ICDAR2013 (Focused Scene dataset) and MSRA-TD500 [18], where texts vary within large scale range and scenes are complex.
In our experiments, we test two text detectors in the second stage to verify the speedup of this two-stage framework. The first detector is fine-tuned from VGG-16 [19] and the second one follows the design of SRPN by keeping the same feature extractor structure. On ICDAR2015, the F-measure score given by two types of detectors are 85.40% and 79.66%, respectively. The VGG-16 based model achieves the state-of-the-art performance at 16.5 fps, which is 5 times faster than previous best results in [20]. The second model also yields competitive performance at 35.1 fps, which definitely satisfies realtime requirement. On ICDAR2013 and MSRA-TD500, we have achieved competitive performance of 88.17% and 80.72%, and high speed of 20.9 fps and 14.6 fps, respectively.
Moreover, we also provide ablation studies to analyze the proposed method from three aspects. The first experiment demonstrates that the two-stage method yields similar or even higher performance compared with traditional multi-scale pyramid based method. The second experiment shows that using a single network to detect text of large scale range yields evidently lower detection performance, and thus justifies the reasonableness of two-stage framework. The third experiment validates the necessity of scale estimation task and auxiliary boundary regression task in SRPN.
In summary, the main contributions of this paper are in four folds:
- 1.
We propose a novel network named SRPN to realize both text/non-text localization and scale estimation efficiently;
- 2.
A two-stage detection scheme based on SRPN is proposed to avoid using multi-scale pyramid processing and achieve faster detection speed;
- 3.
The proposed method achieves remarkable speedup on ICDAR2015, ICDAR2013 and MSRA-TD500 while keeping competitive performance;
- 4.
Ablation experiments are conducted to justify reasonableness of the proposed method from different aspects;
The rest of this paper is organized as follows: Section 2 gives a brief review of recent works on scene text detection, together with works on scale estimation and network acceleration; Section 3 describes details of the proposed method; Section 4 provides experimental results and ablation studies; Section 5 concludes this paper.
Section snippets
Related work
Early works on scene text detection were usually based on manually designed features like Maximally Stable Extremal Regions (MSER) [21], [22], Stroke Width Transformation (SWT) [23], EdgeBox [24] or other types [25], [26]. Even though the works in [27], [28] adopt deep neural networks, the base feature is still based on MSER. Deep learning based methods like [5], [6], [7], [8], [9], [10], [29], [30] can learn discriminative features in end-to-end manner, and have significantly promoted the
The proposed method
As shown in Fig. 2, our method first uses a SRPN to predict text regions and the corresponding scales. Based on the prediction of SRPN, text region proposals are cropped out and resized into proper scales. After that, a FCN based text detector is used to localize each word in text region proposals. Finally, we map the text bounding boxes in region proposals back to the original image to get the final result.
Experiments
Experiments are conducted on various datasets containing both focused and incidental texts.
Conclusion
In this paper, we propose a novel two-stage method to achieve realtime multi-scale scene text detection. The first stage uses a novel light-weight network called Scale-based Region Proposal Network (SRPN) for fast localizing text region proposals with text scale estimated. The SRPN can localize text region of wide scale range fast but imprecisely. Based on text scale estimation, text region proposals are properly resized to a unified normal scale, and then text detector in the second stage only
Acknowledgments
This work has been supported by the National Natural Science Foundation of China (NSFC) Grant 61721004, 61411136002, 61733007, 61633021 and NVIDIA NVAIL program.
Wenhao He received the BS degree in communication engineering from Beihang University, Beijing, China in 2013, and PhD degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2019. He was a visiting researcher at University of La Rochelle, in 2017. He is now a senior researcher at Tencent. His research interests include pattern recognition, deep learning, computer vision and scene text detection.
References (51)
- et al.
Fractals based multi-oriented text detection system for recognition in mobile video images
Pattern Recognit.
(2017) - et al.
Text information extraction in images and video: a survey
Pattern Recognit.
(2004) - et al.
Text/non-text image classification in the wild with convolutional neural networks
Pattern Recognit.
(2017) - et al.
A robust approach for text detection from natural scene images
Pattern Recognit.
(2015) - et al.
LightweightNet: toward fast and lightweight convolutional neural networks via architecture distillation
Pattern Recognit.
(2019) - et al.
ImageNet classification with deep convolutional neural networks
Proc. Neural Information Processing Systems
(2012) - et al.
Faster R-CNN: towards real-time object detection with region proposal networks
Proc. Neural Information Processing Systems
(2015) - et al.
SSD: single shot multibox detector
Proc. European Conference on Computer Vision
(2016) - et al.
Feature pyramid networks for object detection
Proc. IEEE Conf. Computer Vision and Pattern Recognition
(2017) - et al.
EAST: an efficient and accurate scene text detector
Proc. IEEE Conf. Computer Vision and Pattern Recognition
(2017)
Detecting oriented text in natural images by linking segments
Proc. IEEE Conf. Computer Vision and Pattern Recognition
Deep direct regression for multi-oriented scene text detection
Proc. IEEE International Conference on Computer Vision
TextBoxes: a fast text detector with a single deep neural network
Proc. AAAI Conference on Artificial Intelligence
Deep matching prior network: toward tighter multi-oriented text detection
Proc. IEEE Conf. Computer Vision and Pattern Recognition
Detecting text in natural image with connectionist text proposal network
Proc. European Conference on Computer Vision
Real-time lexicon-free scene text localization and recognition
IEEE Trans. Pattern Anal. Mach.Intell.
Real time image enhancement and segmentation for sign/text detection
Proc. International Conference on Image Processing
Fully convolutional networks for semantic segmentation
Proc. IEEE Conf. Computer Vision and Pattern Recognition
ICDAR 2015 Competition on Robust Reading
Proc. International Conference on Document Analysis and Recognition
Detecting texts of arbitrary orientations in natural images
Proc. IEEE Conf. Computer Vision and Pattern Recognition
Very deep convolutional networks for large-scale image recognition
Proc. International Conference on Learning Representations
PixelLink: detecting scene text via instance segmentation
Proc. AAAI Conference on Artificial Intelligence
A novel text structure feature extractor for chinese scene text detection and recognition
IEEE Access
Detecting text in natural scenes with stroke width transform
Proc. IEEE Conf. Computer Vision and Pattern Recognition
Reading text in the wild with convolutional neural networks
Int. J. Comput. Vis.
Cited by (44)
A pyramid input augmented multi-scale CNN for GGO detection in 3D lung CT images
2023, Pattern RecognitionCitation Excerpt :Ma et al. [22] proposed a deep feature learning modules for object detection to activate multi-scale receptive fields within a much wider range at a single layer level. He et al. [23] proposed a scale-based region proposal network which can localize text of wide scale range to avoid multi-scale pyramid processing. Cheng et al. [24] proposed a multi-scale residual network to process image at different spatial resolutions for image demoireing.
An efficient ROI detection algorithm for Bangla text extraction and recognition from natural scene images
2022, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :Recent development of deep learning based approaches towards text detection from scene images demonstrates its high performance in complex environments, leading to effectiveness and robustness into the problem. Several deep learning based methods have been reported so far to tackle with extremely diverse scene texts (Wang et al., 2019; Zhang et al., 2019; Xue et al., 2019; Kobchaisawat et al., 2020; He et al., 2020 and Ma et al., 2020). According to our analysis, state of the art deep learning approaches are broadly categorized into four groups, (1) regression-based methods, (2) segmentation based methods, (3) hybrid methods, and (4) end to end text spotting.
Detection and rectification of arbitrary shaped scene texts by using text keypoints and links
2022, Pattern RecognitionCitation Excerpt :The CNN-based works detect scene texts via three typical approaches. The first approach [8–11] treats scene texts as a specific class of objects and follows the pipeline of generic object detection that detects word/line-level texts directly by using Faster RCNN [12], etc. This approach either designs text-specific proposals/default boxes to fit the shapes of text instances [2,13,14], or regresses text pixels to the boundaries/vertices of text instances [1,15–17].
A multimodal attention fusion network with a dynamic vocabulary for TextVQA
2022, Pattern RecognitionNon-stationary content-adaptive projector resolution enhancement
2021, Signal Processing: Image CommunicationCitation Excerpt :However, these methods assume a degree of computational and memory resources which are not necessarily available as part of projector hardware, even for relatively simple filtering such as in (4). Despite the recent advancement in tiny deep learning-based models, such as MobileNet [42], SqueezeNet [43], ShuffleNet [44,45] and EfficientNet [46], to tackle several text detection [42,44,46] and motion detection [45] tasks, these methods [30–46] are data driven, which makes the trained learner restricted to specific video categories and often not able to generalize to unseen contents [47–49]. Finally, text-related methods are specifically intended to localize or recognize text [50], whereas for us it is a broader question of high-contrast regions which need to be detected.
Wenhao He received the BS degree in communication engineering from Beihang University, Beijing, China in 2013, and PhD degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2019. He was a visiting researcher at University of La Rochelle, in 2017. He is now a senior researcher at Tencent. His research interests include pattern recognition, deep learning, computer vision and scene text detection.
Xu-Yao Zhang received BS degree in computational mathematics from Wuhan University, Wuhan, China, in 2008 and PhD degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2013. He is currently an associate professor in National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He was a visiting researcher at CENPARMI of Concordia University, in 2012. From March 2015 to March 2016, he was a visiting scholar in Montreal Institute for Learning Algorithms (MILA), University of Montreal, Canada. His research interests include machine learning, pattern recognition, handwriting recognition, and deep learning.
Fei Yin received the BS degree in computer science from Xidian University of Posts and Telecommunications, Xi’an, China, in 1999, the ME degree in pattern recognition and intelligent systems from Huazhong University of Science and Technology, Wuhan, China, in 2002, and the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2010. He is an associate professor in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include document image analysis, handwritten character recognition, and image processing. He has published more than 50 papers at international journals and conferences.
Zhenbo Luo received the Master degree in Electronics Engineering from Tsinghua University in 2006, and the supervisor is Professor Xiaoqing Ding. Received the B.Sc. degree in Electronics Engineering from Fudan University in 2003. He is a Principal Engineer and lab leader of Machine learning lab of Samsung R&D Institute China. His research interests include vision and learning. He leads the team to develop GAN, text recognition, AR, Human understanding technologies for Samsung smart phones, visual display and printing business.
Jean-Marc Ogier received his PhD degree in computer science from the University of Rouen, France, in 1994. During this period (1991–1994), he worked on graphic recognition for Matra Ms&I Company. Now full professor at the university of la Rochelle, Pr Ogier was the head of L3I laboratory in computer sciences of the university of la Rochelle which gathers more than 120 members and works mainly of Document Analysis and Content Management. Author of more than 200 publications / communications, he managed several French and European projects dealing with document analysis, either with public institutions, or with private companies. Pr Ogier was Deputy Director of the GDR I3 of the French National Research Centre (CNRS) between 2005 and 2013. He was also Chair of the Technical Committee 10 (Graphic Recognition) of the International Association for Pattern Recognition (IAPR) from 2010 to 2015, and is the representative member of France at the governing board of the IAPR. He is now the general chair of the TC6, dealing with computational forensics of the International Association for Pattern Recognition. Jean-Marc Ogier has been the general chair or the program chair of several international scientific events dealing with document analysis (DAS, ICDAR, GREC,... ) He is now the president of the university of La Rochelle
Cheng-Lin Liu is a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing, China, and is now the director of the laboratory. He received the B.S. degree in electronic engineering from Wuhan University, Wuhan, China, the M.E. degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in pattern recognition and intelligent control from the Institute of Automation of Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 200 technical papers at prestigious international journals and conferences. He is an associate editor-in-chief of Pattern Recognition, and is on the editorial board of Image and Vision Computing, International Journal on Document Analysis and Recognition, and Cognitive Computation. He is a fellow of the IEEE and the IAPR.