CE-text: A context-Aware and embedded text detector in natural scene images

https://doi.org/10.1016/j.patrec.2022.05.004Get rights and content

Highlights

  • A novel deep and context-aware CNN structure for accurate and fast text detection

  • Hierarchically channel wise attention scheme with channel wise and multilayer features

  • Adopts frequency-based deep compression method to build a lightweight text detector

Abstract

With the significant power of deep learning architectures, researchers have made much progress on effectiveness and efficiency of text detection in the past few years. However, due to the lack of consideration of unique characteristics of text components, directly applying deep learning models to perform text detection task is prone to result in low accuracy, especially producing false positive detection results. To ease this problem, we propose a lightweight and context-aware deep convolutional neural network (CNN) named as CE-Text, which appropriately encodes multi-level channel attention information to construct discriminative feature map for accurate and efficient text detection. To fit with low computation resource of embedded systems, we further transform CE-Text into a lighter version with a frequency based deep CNN compression method, which expands applicable scenarios of CE-Text into variant embedded systems. Experiments on several popular datasets show that CE-Text not only has achieved accurate text detection results in scene images, but also could run with fast performance in embedded systems.

Introduction

Due to the large variations in text and complex backgrounds, many deep learning based techniques have been proposed to improve the accuracy and robustness of text detection. However, these methods greatly suffer from slow optimization and detection speed, since each individual component must be trained and parameter tuning separately. There exists a trend in directly predicting word bounding boxes through an lightweight and single neural network. For example, [1] uses a single fully-convolutional network coping with bounding boxes of extreme aspect ratios to perform fast text detection [2]. However, such methods generally regard text as one class of objects without involving unique text characteristics for higher accuracy. Moreover, their proposed method are only deployed and tested on systems with Nvidia Titan GPUs, due to their high requirement for computation resource.

With this context, we build a context-aware and embedded text detector named as CE-Text to help detect text in natural scene images with high accuracy and efficiency. Fig. 1 shows the workflow of CE-Text, where step (a) inputs a natural scene image, step (b) represents a hierarchical channel-wise attention scheme to generate text context-aware feature map for higher detection accuracy, a lightweight convolutional neural network builded on feature map is proposed in (c) to predict multiple bounding boxes and corresponding text existence probabilities, step (d) presents an embedded version of text detector constructed by a frequency-based CNN compressing method, and step (e) obtains text detection results with bounding boxes and associated probabilities. Our motivations to involve techniques mainly rely on the following two considerations:

1) To overcome factors for wrong predictions, it’s essential to construct highly discriminative features for accurate detection. By stacking different layers, a CNN extracts image features through a hierarchical representation of visual abstractions. Therefore, features extracted from CNN structure are essentially channel-wise and multi-layer. However, not all the features are equally important and informative for detection of text components. Therefore, we propose a hierarchical attention scheme to encode text context-aware information, which automatically focuses on text-related characteristics and discriminative feature channels for accurate text detection.

2) The main difficulty to shift deep neural networks to embedded systems lies in their high request for computational and storage intensity. For example, original YOLO is over 230MB in size, containing 30 layers and 6.74*108 parameters. Running YOLO with such a tremendous number of parameters consumes large storage and computational resources. Due to the nature of local pixel correlation in images, i.e., spatial locality, parameters of channel filters in the proposed text detectors or other visual content analysis systems tend to be smooth. Therefore, we convert parameters matrix into frequency domain and only utilize their low-frequency parts. The embedded version of CE-Text is thus appropriate to apply in embedding systems with low storage and memory size.

The contributions of this paper are three-fold:

  • We propose a novel deep and context-aware CNN structure for text detection, in which a task-specified hierarchical attention scheme is adopted to enhance feature representative ability on the basis of multi-level and channel-wise context information.

  • A hierarchically channel-wise attention scheme is carefully designed with channel-wise and multi-layer features, involving inherent and unique context characteristics of text components for better detection results.

  • CE-Text adopts a novel frequency-based deep compression method to build a lightweight and embedded text detector, which have properties of highly-representative feature map, fast computing speed and small storage size.

Section snippets

Text detection methods

We generally category recent methods on text detection as regression based and segmentation based text detection [3]. Scene text detection have been greatly affected by the thought of directly predicting location without proposals and post-processing steps. EAST [4] propose a simple yet powerful pipeline that yields fast and accurate text detection for words or text lines of arbitrary orientations with a single neural network. SegLink [5] first predicts text segments and then links text

The proposed method

In this section, we describe CE-Text by lightweight CNN structure design, hierarchical attention scheme and embedded version design with deep compression.

Implementation details

Experiments are all carried out on a server, which is configured with Intel Xeon E5-2630v4 CPU (10cores and each with 2.2GHz), 64G memory and 1 piece of NVIDIA Titan X card. Meanwhile, we perform all embedded system related experiments on an ARM Cortex-A9 (4 cores @ 1.60GHz) Embedded system with 2G RAM. Training for CE-Text is performed by an Adam optimizer with adaptive learning rate, which is settled with an initial learning rate 0.001 and batch size 12. Compared to batch gradient descent

Conclusion and future work

In this work, we propose a lightweight and context-aware deep convolutional neural network (CNN) for text detection. The proposed method proposes a hierarchical text attention scheme, which captures context information by constructing multi-level channel attention modules. To fit with low computation resource of embedded systems, we further transform CE-Text into a lighter version with a frequency based deep CNN compression method. Experimental results on both workstations and embedded systems

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by National Key R&D Program of China under Grant 2021YFB3900601, the fundamental research funds for the central universities (32519113008,32519113006, 31512111310,B220202074) and by the open project from the State Key Laboratory for Novel Software Technology, Nanjing University, under Grant No. KFKT2019B17.

References (24)

  • S. Long et al.

    Textsnake: A flexible representation for detecting text of arbitrary shapes

    Proceedings of 2018 European Conference on Computer Vision

    (2018)
  • P. Lyu et al.

    Multi-oriented scene text detection via corner localization and region segmentation

    Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • Cited by (8)

    View all citing articles on Scopus
    View full text