Arbitrarily shaped scene text detection with dynamic convolution

https://doi.org/10.1016/j.patcog.2022.108608Get rights and content

Highlights

  • According to the detailed characteristics of the text instance, we dynamically generate the convolutional kernels from multi-feature for different instances. The specific attributes such as position, scale, and center, have been embedded into the convolutional kernel so that the mask prediction task using the text-instance-aware kernel will focus on the pixels that belong to themselves. Obviously, this design is helpful to improve the detection accuracy of adjacent text instances.

  • We generate the respective mask prediction head for each instance in parallel. These heads predict masks on the original feature map and retain resolution details of the text instance. It is no longer necessary to crop the RoIs and force them to be the same size. Our architecture overcomes the problem that a set of fixed convolution kernels cannot adapt to all resolutions, and at the same time preventing the loss of information caused by the multi-scales of the instances.

  • Because improving the text-instance-aware convolutional kernel increases the capacity of the model, we can also achieve competitive results with a very compact prediction head. Therefore, multiple mask prediction heads can be concurrently predicted without bringing significant computational overhead.

  • For the sake of improving the performance and accelerating the convergence of training, we design a text-shape sensitive position embedding to explicitly provide the location information to the mask prediction head.

Abstract

Arbitrarily shaped scene text detection has witnessed great development in recent years, and text detection using segmentation has been proven to an effective approach. However, problems caused by the diverse attributes of text instances, such as shapes, scales, and presentation styles (dense or sparse), persist. In this paper, we propose a novel text detector, termed DText, which can effectively formulate an arbitrarily shaped scene text detection task based on dynamic convolution. Our method can dynamically generate independent text-instance-aware convolutional parameters for each text instance from multi-features thus overcoming some intractable limitations of arbitrary text detection, such as the splitting of similar adjacent text, which poses challenges to fixed instance-shared convolutional parameters-based methods. Unlike standard segmentation methods relying on regions-of-interest bounding boxes, DText focuses on enhancing the flexibility of the network to retain details of instances from diverse resolutions while effectively improving prediction accuracy. Moreover, we propose encoding the shape and position information according to the characteristics of the text instance, termed text-shape sensitive position embedding. Thus, it can provide explicit shape and position information to the generator of the dynamic convolution parameters. Experiments on five benchmarks (Total-Text, SCUT-CTW1500, MSRA-TD500, ICDAR2015, and MLT) showed that our method achieves superior detection performance.

Introduction

Scene text detection has attracted increasing attention in the past decades. The rich semantic information provided by scene text has facilitated applications, such as blind navigation, automatic pilots, and product identification. Owing to the recent development of general object detection [1], [2], various methods have been proposed to detect multi-oriented scene text. However, most methods often rely on the regression of rectangles or quadrangle to obtain text bounding boxes, which struggle to tightly enclose the arbitrary-shaped text. As a result, much irrelevant image content, including the background and other instances, get included in the bounding box. The detection of arbitrarily-shaped text requires algorithms to predict the exact position and shape of each text instance in the target image and include as little background as possible. With the widespread application of fully convolutional neural networks (FCNs), segmentation-based text detection methods implement text detection from the prediction of pixel-level text instance masks, thereby promoting the detection of arbitrarily-shaped text [3], [4]. However, FCNs-based methods suffer from accurate predictions for neighboring instances with similar image appearances, and text instances in scene images are often quite close to each other at various scales and numbers. Thus, complicated post-processing steps are inevitable in many segmentation-based text detection methods used to split and merge the instance information from the mask region to reconstruct the words or text lines. Therefore, the effective splitting of adjacent text and retention of detailed information from diverse scales of different instances without post-processing, is still intractable and worth further exploration.

Here, in contrast to previous methods, we propose a novel arbitrarily-shaped text detector, named DText, which formulates an arbitrarily-shaped text detection problem based on dynamic convolution. Our motivations are based on the following observations:

  • Many existing regression-based text detection methods are restricted by inflexible bounding boxes to predict arbitrarily-shaped text.

  • Using fixed instance-shared convolutional kernels makes the FCNs-based text segmentation methods suffer from predicting individual masks for adjacent instances with similar appearances; thus, complicated post-processing is required to distinguish these.

  • Previous segmentation-based methods usually crop the region of interest (RoI) of each instance and then use a fixed head to predict the mask for multi-scale RoI. Unifying the resolution of the RoI is necessary becuase of the fixed structure of the head. Clearly, it is difficult to determine the optimal resolution for instances with different scales.

Therefore, inspired by [5], [6], we propose to dynamically generate convolutional kernels according to the particularities of the text. compared to [6], which focuses on aggregating convolutional kernels and maintaining fixed kernels and variational attentions in the inference stage, our method directly generates convolutional kernels to improve the representation capability of the network and the variational of the kernel and maintains example-specific kernels in the inference stage. Compared to CondInst [7], our method can be more sensitive to the various shapes of the text instances. The main contributions of this study are summarized as follows.

1) We dynamically generated convolutional kernels from multiple features for different text instances according to their detailed characteristics. The specific attributes, such as position, scale, and center, were embedded into the convolutional kernel so that the mask prediction task using the text-instance-aware kernel will focus on pixels where they belong. 2) We generated the respective mask prediction head for each instance in parallel. These heads predict masks on the original feature map and retain the resolution details of the text instance. Thus, it is not necessary to crop the RoIs and force them to be of the same size. Our architecture overcomes the problem arising when a set of fixed convolutional kernels cannot adapt to all resolutions, and simultaneously prevents the loss of information caused by the multiple scales of the instances. Clearly, this design helps to improve the detection accuracy of adjacent text instances, as it predicts the text mask using different kernels in different channels. 3) Because improving the text-instance-aware convolutional kernel increases the capacity of the model, we could also achieve competitive results with a very compact prediction head. Therefore, multiple mask prediction heads can be predicted concurrently without significant computational overhead. 4) To improve the performance and accelerate convergence of training, we designed a text-shape-sensitive position embedding to provide the explicit location to the mask prediction head.

DText demonstrates that dynamic convolution can embed more high-level graphical semantic information of the text instance into the dynamic kernel to improve the performance of the model. An example of DText prediction is shown in Fig. 1. Experiments conducted on five challenging benchmark datasets, including ICDAR 2015 [8], MLT [9], CTW1500 [10], Total-Text [11], and MSRA-TD500 [12], demonstrated the effectiveness of the proposed DText.

Section snippets

Related work

Traditional scene text detection methods mainly extract low- or mid-level handcrafted image features and require complex pre-processing and post-processing steps. Examples include the connected components analysis (CCA) based methods, sliding window (SW) based methods, and handcrafted image feature-based methods. With the emergence and development of deep learning, we have witnessed substantial advancements in arbitrarily shaped scene text detection in terms of both performance and robustness.

Methodology

DText integrates the potential of two basic prediction methods (regression and segmentation), which practically shows more robustness for arbitrary shapes. Furthermore, we take the high-level features of multiple tasks in traditional object detection as input to dynamically generate the required convolutional parameters for the text mask prediction task. As a result, the convolutional parameters can be dynamically adjusted in the inference stage. In addition, these dynamic generated

Experiments

In order to evaluate the effectiveness of our method, we first conduct ablation experiments on CTW1500, and then we compare our method with previous state-of-the-art methods on both multi-oriented and arbitrarily shaped scene text benchmarks. Our method is based on PyTorch 1.8.1.

Conclusion

In this study, we successfully adopt dynamic convolution for scene text detection. This first attempt indicates the promising potential values of the proposed DText, i.e., it can improve the detection accuracy of arbitrarily shaped text by maintaining the integrity of the text instance and providing a unique convolution kernel. In addition, it can solve the problem of adhesion of dense text. Exhaustive ablation studies demonstrate the effectiveness of the residual connection in dynamic

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the China Scholarship Council (Grant No. 201908510079), Fundamental Research Funds for the Central Universities (Grant No. ZYN2022029), Guangdong Provincial Natural Science Foundation (Grant No. 2017A030312006), National Natural Science Foundation of China (Grant No. 61936003, 72174172, 71774134).

Ying Cai received the M.S. degree from Kunming University of Science and Technology, Kunming, China, in the area of Artificial intelligence and pattern recognition and the Ph.D. degree from Sichuan University, Chengdu, China, in the area of Electronics and information, in 2007 and 2017, respectively. Currently, she is an associate professor at the Southwest Minzu University and a postdoctoral researcher at Adelaide University, Australia. Her research interests include text detection and

References (40)

  • Y. Chen et al.

    Dynamic convolution: Attention over convolution kernels

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)
  • Z. Tian et al.

    Conditional convolutions for instance segmentation

    Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16

    (2020)
  • D. Karatzas et al.

    ICDAR 2015 competition on robust reading

    2015 13th International Conference on Document Analysis and Recognition (ICDAR)

    (2015)
  • N. Nayef et al.

    ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt

    2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

    (2017)
  • C.K. Ch’ng et al.

    Total-text: A comprehensive dataset for scene text detection and recognition

    2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

    (2017)
  • C. Yao et al.

    Detecting texts of arbitrary orientations in natural images

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • M. Liao et al.

    TextBoxes++: a single-shot oriented scene text detector

    IEEE Trans. Image Process.

    (2018)
  • Z. Zhong et al.

    DeepText: A new approach for text proposal generation and text detection in natural images

    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2017)
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    IEEE Trans Pattern Anal Mach Intell

    (2016)
  • B. Shi et al.

    Detecting oriented text in natural images by linking segments

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Cited by (21)

    • Automatically classifying non-functional requirements using deep neural network

      2022, Pattern Recognition
      Citation Excerpt :

      In recent years, more and more researchers have applied deep learning networks to text processing, which has greatly promoted the development of natural language. Cai et al. [16] proposed a novel text detector, termed DText, which can effectively formulate an arbitrarily shaped scene text detection task based on dynamic convolution. Their method can dynamically generate independent text-instance-aware convolutional parameters for each text instance from multi-features thus overcoming some intractable limitations of arbitrary text detection, such as the splitting of similar adjacent text, which poses challenges to fixed instance-shared convolutional parameters-based methods.

    • A Deep Convolutional Neural Network-Based Method for Self-Piercing Rivet Joint Defect Detection

      2024, Journal of Computing and Information Science in Engineering
    View all citing articles on Scopus

    Ying Cai received the M.S. degree from Kunming University of Science and Technology, Kunming, China, in the area of Artificial intelligence and pattern recognition and the Ph.D. degree from Sichuan University, Chengdu, China, in the area of Electronics and information, in 2007 and 2017, respectively. Currently, she is an associate professor at the Southwest Minzu University and a postdoctoral researcher at Adelaide University, Australia. Her research interests include text detection and recognition, 2D and 3D face recognition.

    Yuliang Liu received the B.S. degree in South China University of Technology. He received his Ph.D. degrees in Information and Communication Engineering with the Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology, under the supervision of Prof. L. Jin. He was a postdoc supervised by Prof. Chunhua Shen in University of Adelaide. His current research interests include computer vision, scene text detection and recognition.

    Chunhua Shen is a Professor at the School of Computer Science, University of Adelaide, Australia.

    Lianwen Jin (Member, IEEE) received the B.S. degree from the University of Science and Technology of China, Anhui, China, and the Ph.D. degree from the South China University of Technology, Guangzhou, China, in 1991 and 1996, respectively. He is currently a Professor with the College of Electronic and Information Engineering, South China University of Technology. His research interests include optical character recognition, computer vision, machine learning, and artificial intelligence. He has authored over 200 scientific articles. He is a member of the IEEE Signal Processing Society and the IEEE Computer Society. He received the New Century Excellent Talent Program of MOE Award and the Guangdong Pearl River Distinguished Professor Award in 2006.

    Yidong Li is a Professor at the School of Computer and Information Technology, Beijing Jiaotong University, China.

    Daji Ergu is a Professor at the Southwest Minzu University, Chengdu, China.

    View full text