Arbitrarily shaped scene text detection with dynamic convolution
Introduction
Scene text detection has attracted increasing attention in the past decades. The rich semantic information provided by scene text has facilitated applications, such as blind navigation, automatic pilots, and product identification. Owing to the recent development of general object detection [1], [2], various methods have been proposed to detect multi-oriented scene text. However, most methods often rely on the regression of rectangles or quadrangle to obtain text bounding boxes, which struggle to tightly enclose the arbitrary-shaped text. As a result, much irrelevant image content, including the background and other instances, get included in the bounding box. The detection of arbitrarily-shaped text requires algorithms to predict the exact position and shape of each text instance in the target image and include as little background as possible. With the widespread application of fully convolutional neural networks (FCNs), segmentation-based text detection methods implement text detection from the prediction of pixel-level text instance masks, thereby promoting the detection of arbitrarily-shaped text [3], [4]. However, FCNs-based methods suffer from accurate predictions for neighboring instances with similar image appearances, and text instances in scene images are often quite close to each other at various scales and numbers. Thus, complicated post-processing steps are inevitable in many segmentation-based text detection methods used to split and merge the instance information from the mask region to reconstruct the words or text lines. Therefore, the effective splitting of adjacent text and retention of detailed information from diverse scales of different instances without post-processing, is still intractable and worth further exploration.
Here, in contrast to previous methods, we propose a novel arbitrarily-shaped text detector, named DText, which formulates an arbitrarily-shaped text detection problem based on dynamic convolution. Our motivations are based on the following observations:
- •
Many existing regression-based text detection methods are restricted by inflexible bounding boxes to predict arbitrarily-shaped text.
- •
Using fixed instance-shared convolutional kernels makes the FCNs-based text segmentation methods suffer from predicting individual masks for adjacent instances with similar appearances; thus, complicated post-processing is required to distinguish these.
- •
Previous segmentation-based methods usually crop the region of interest (RoI) of each instance and then use a fixed head to predict the mask for multi-scale RoI. Unifying the resolution of the RoI is necessary becuase of the fixed structure of the head. Clearly, it is difficult to determine the optimal resolution for instances with different scales.
Therefore, inspired by [5], [6], we propose to dynamically generate convolutional kernels according to the particularities of the text. compared to [6], which focuses on aggregating convolutional kernels and maintaining fixed kernels and variational attentions in the inference stage, our method directly generates convolutional kernels to improve the representation capability of the network and the variational of the kernel and maintains example-specific kernels in the inference stage. Compared to CondInst [7], our method can be more sensitive to the various shapes of the text instances. The main contributions of this study are summarized as follows.
1) We dynamically generated convolutional kernels from multiple features for different text instances according to their detailed characteristics. The specific attributes, such as position, scale, and center, were embedded into the convolutional kernel so that the mask prediction task using the text-instance-aware kernel will focus on pixels where they belong. 2) We generated the respective mask prediction head for each instance in parallel. These heads predict masks on the original feature map and retain the resolution details of the text instance. Thus, it is not necessary to crop the RoIs and force them to be of the same size. Our architecture overcomes the problem arising when a set of fixed convolutional kernels cannot adapt to all resolutions, and simultaneously prevents the loss of information caused by the multiple scales of the instances. Clearly, this design helps to improve the detection accuracy of adjacent text instances, as it predicts the text mask using different kernels in different channels. 3) Because improving the text-instance-aware convolutional kernel increases the capacity of the model, we could also achieve competitive results with a very compact prediction head. Therefore, multiple mask prediction heads can be predicted concurrently without significant computational overhead. 4) To improve the performance and accelerate convergence of training, we designed a text-shape-sensitive position embedding to provide the explicit location to the mask prediction head.
DText demonstrates that dynamic convolution can embed more high-level graphical semantic information of the text instance into the dynamic kernel to improve the performance of the model. An example of DText prediction is shown in Fig. 1. Experiments conducted on five challenging benchmark datasets, including ICDAR 2015 [8], MLT [9], CTW1500 [10], Total-Text [11], and MSRA-TD500 [12], demonstrated the effectiveness of the proposed DText.
Section snippets
Related work
Traditional scene text detection methods mainly extract low- or mid-level handcrafted image features and require complex pre-processing and post-processing steps. Examples include the connected components analysis (CCA) based methods, sliding window (SW) based methods, and handcrafted image feature-based methods. With the emergence and development of deep learning, we have witnessed substantial advancements in arbitrarily shaped scene text detection in terms of both performance and robustness.
Methodology
DText integrates the potential of two basic prediction methods (regression and segmentation), which practically shows more robustness for arbitrary shapes. Furthermore, we take the high-level features of multiple tasks in traditional object detection as input to dynamically generate the required convolutional parameters for the text mask prediction task. As a result, the convolutional parameters can be dynamically adjusted in the inference stage. In addition, these dynamic generated
Experiments
In order to evaluate the effectiveness of our method, we first conduct ablation experiments on CTW1500, and then we compare our method with previous state-of-the-art methods on both multi-oriented and arbitrarily shaped scene text benchmarks. Our method is based on PyTorch 1.8.1.
Conclusion
In this study, we successfully adopt dynamic convolution for scene text detection. This first attempt indicates the promising potential values of the proposed DText, i.e., it can improve the detection accuracy of arbitrarily shaped text by maintaining the integrity of the text instance and providing a unique convolution kernel. In addition, it can solve the problem of adhesion of dense text. Exhaustive ablation studies demonstrate the effectiveness of the residual connection in dynamic
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the China Scholarship Council (Grant No. 201908510079), Fundamental Research Funds for the Central Universities (Grant No. ZYN2022029), Guangdong Provincial Natural Science Foundation (Grant No. 2017A030312006), National Natural Science Foundation of China (Grant No. 61936003, 72174172, 71774134).
Ying Cai received the M.S. degree from Kunming University of Science and Technology, Kunming, China, in the area of Artificial intelligence and pattern recognition and the Ph.D. degree from Sichuan University, Chengdu, China, in the area of Electronics and information, in 2007 and 2017, respectively. Currently, she is an associate professor at the Southwest Minzu University and a postdoctoral researcher at Adelaide University, Australia. Her research interests include text detection and
References (40)
- et al.
Curved scene text detection via transverse and longitudinal sequence connection
Pattern Recognition
(2019) - et al.
Rotated cascade R-CNN: a shape robust detector with coordinate regression
Pattern Recognition
(2019) - et al.
Scale robust deep oriented-text detection network
Pattern Recognition
(2020) - et al.
A quadrilateral scene text detector with two-stage network architecture
Pattern Recognition
(2020) - et al.
SegLink++: detecting dense and arbitrary-shaped scene text by instance-aware component grouping
Pattern Recognition
(2019) - et al.
Mask R-CNN
Proceedings of the IEEE International Conference on Computer Vision
(2017) - et al.
FCOS: Fully Convolutional One-stage Object Detection
Proceedings of the IEEE international conference on computer vision
(2019) - et al.
Textfield: learning a deep direction field for irregular scene text detection
IEEE Trans. Image Process.
(2019) - et al.
Character region awareness for text detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2019) - et al.
SOLOv2: Dynamic, faster and stronger
arXiv preprint arXiv:2003.10152
(2020)
Dynamic convolution: Attention over convolution kernels
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Conditional convolutions for instance segmentation
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16
ICDAR 2015 competition on robust reading
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)
Total-text: A comprehensive dataset for scene text detection and recognition
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)
Detecting texts of arbitrary orientations in natural images
2012 IEEE Conference on Computer Vision and Pattern Recognition
TextBoxes++: a single-shot oriented scene text detector
IEEE Trans. Image Process.
DeepText: A new approach for text proposal generation and text detection in natural images
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Faster R-CNN: towards real-time object detection with region proposal networks
IEEE Trans Pattern Anal Mach Intell
Detecting oriented text in natural images by linking segments
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (21)
DSText V2: A comprehensive video text spotting dataset for dense and small text
2024, Pattern RecognitionShadow-aware dynamic convolution for shadow removal
2024, Pattern RecognitionA new deep CNN for 3D text localization in the wild through shadow removal
2024, Computer Vision and Image UnderstandingAutomatically classifying non-functional requirements using deep neural network
2022, Pattern RecognitionCitation Excerpt :In recent years, more and more researchers have applied deep learning networks to text processing, which has greatly promoted the development of natural language. Cai et al. [16] proposed a novel text detector, termed DText, which can effectively formulate an arbitrarily shaped scene text detection task based on dynamic convolution. Their method can dynamically generate independent text-instance-aware convolutional parameters for each text instance from multi-features thus overcoming some intractable limitations of arbitrary text detection, such as the splitting of similar adjacent text, which poses challenges to fixed instance-shared convolutional parameters-based methods.
Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection
2024, Neural Processing LettersA Deep Convolutional Neural Network-Based Method for Self-Piercing Rivet Joint Defect Detection
2024, Journal of Computing and Information Science in Engineering
Ying Cai received the M.S. degree from Kunming University of Science and Technology, Kunming, China, in the area of Artificial intelligence and pattern recognition and the Ph.D. degree from Sichuan University, Chengdu, China, in the area of Electronics and information, in 2007 and 2017, respectively. Currently, she is an associate professor at the Southwest Minzu University and a postdoctoral researcher at Adelaide University, Australia. Her research interests include text detection and recognition, 2D and 3D face recognition.
Yuliang Liu received the B.S. degree in South China University of Technology. He received his Ph.D. degrees in Information and Communication Engineering with the Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology, under the supervision of Prof. L. Jin. He was a postdoc supervised by Prof. Chunhua Shen in University of Adelaide. His current research interests include computer vision, scene text detection and recognition.
Chunhua Shen is a Professor at the School of Computer Science, University of Adelaide, Australia.
Lianwen Jin (Member, IEEE) received the B.S. degree from the University of Science and Technology of China, Anhui, China, and the Ph.D. degree from the South China University of Technology, Guangzhou, China, in 1991 and 1996, respectively. He is currently a Professor with the College of Electronic and Information Engineering, South China University of Technology. His research interests include optical character recognition, computer vision, machine learning, and artificial intelligence. He has authored over 200 scientific articles. He is a member of the IEEE Signal Processing Society and the IEEE Computer Society. He received the New Century Excellent Talent Program of MOE Award and the Guangdong Pearl River Distinguished Professor Award in 2006.
Yidong Li is a Professor at the School of Computer and Information Technology, Beijing Jiaotong University, China.
Daji Ergu is a Professor at the Southwest Minzu University, Chengdu, China.