CE-text: A context-Aware and embedded text detector in natural scene images

doi:10.1016/j.patrec.2022.05.004

Pattern Recognition Letters

Volume 159, July 2022, Pages 77-83

https://doi.org/10.1016/j.patrec.2022.05.004 Get rights and content

Highlights

•
A novel deep and context-aware CNN structure for accurate and fast text detection
•
Hierarchically channel wise attention scheme with channel wise and multilayer features
•
Adopts frequency-based deep compression method to build a lightweight text detector

Abstract

With the significant power of deep learning architectures, researchers have made much progress on effectiveness and efficiency of text detection in the past few years. However, due to the lack of consideration of unique characteristics of text components, directly applying deep learning models to perform text detection task is prone to result in low accuracy, especially producing false positive detection results. To ease this problem, we propose a lightweight and context-aware deep convolutional neural network (CNN) named as CE-Text, which appropriately encodes multi-level channel attention information to construct discriminative feature map for accurate and efficient text detection. To fit with low computation resource of embedded systems, we further transform CE-Text into a lighter version with a frequency based deep CNN compression method, which expands applicable scenarios of CE-Text into variant embedded systems. Experiments on several popular datasets show that CE-Text not only has achieved accurate text detection results in scene images, but also could run with fast performance in embedded systems.

Introduction

Due to the large variations in text and complex backgrounds, many deep learning based techniques have been proposed to improve the accuracy and robustness of text detection. However, these methods greatly suffer from slow optimization and detection speed, since each individual component must be trained and parameter tuning separately. There exists a trend in directly predicting word bounding boxes through an lightweight and single neural network. For example, [1] uses a single fully-convolutional network coping with bounding boxes of extreme aspect ratios to perform fast text detection [2]. However, such methods generally regard text as one class of objects without involving unique text characteristics for higher accuracy. Moreover, their proposed method are only deployed and tested on systems with Nvidia Titan GPUs, due to their high requirement for computation resource.

With this context, we build a context-aware and embedded text detector named as CE-Text to help detect text in natural scene images with high accuracy and efficiency. Fig. 1 shows the workflow of CE-Text, where step (a) inputs a natural scene image, step (b) represents a hierarchical channel-wise attention scheme to generate text context-aware feature map for higher detection accuracy, a lightweight convolutional neural network builded on feature map is proposed in (c) to predict multiple bounding boxes and corresponding text existence probabilities, step (d) presents an embedded version of text detector constructed by a frequency-based CNN compressing method, and step (e) obtains text detection results with bounding boxes and associated probabilities. Our motivations to involve techniques mainly rely on the following two considerations:

1) To overcome factors for wrong predictions, it’s essential to construct highly discriminative features for accurate detection. By stacking different layers, a CNN extracts image features through a hierarchical representation of visual abstractions. Therefore, features extracted from CNN structure are essentially channel-wise and multi-layer. However, not all the features are equally important and informative for detection of text components. Therefore, we propose a hierarchical attention scheme to encode text context-aware information, which automatically focuses on text-related characteristics and discriminative feature channels for accurate text detection.

2) The main difficulty to shift deep neural networks to embedded systems lies in their high request for computational and storage intensity. For example, original YOLO is over 230MB in size, containing 30 layers and $6.74 * 10^{8}$ parameters. Running YOLO with such a tremendous number of parameters consumes large storage and computational resources. Due to the nature of local pixel correlation in images, i.e., spatial locality, parameters of channel filters in the proposed text detectors or other visual content analysis systems tend to be smooth. Therefore, we convert parameters matrix into frequency domain and only utilize their low-frequency parts. The embedded version of CE-Text is thus appropriate to apply in embedding systems with low storage and memory size.

The contributions of this paper are three-fold:

•
We propose a novel deep and context-aware CNN structure for text detection, in which a task-specified hierarchical attention scheme is adopted to enhance feature representative ability on the basis of multi-level and channel-wise context information.
•
A hierarchically channel-wise attention scheme is carefully designed with channel-wise and multi-layer features, involving inherent and unique context characteristics of text components for better detection results.
•
CE-Text adopts a novel frequency-based deep compression method to build a lightweight and embedded text detector, which have properties of highly-representative feature map, fast computing speed and small storage size.

Section snippets

Text detection methods

We generally category recent methods on text detection as regression based and segmentation based text detection [3]. Scene text detection have been greatly affected by the thought of directly predicting location without proposals and post-processing steps. EAST [4] propose a simple yet powerful pipeline that yields fast and accurate text detection for words or text lines of arbitrary orientations with a single neural network. SegLink [5] first predicts text segments and then links text

The proposed method

In this section, we describe CE-Text by lightweight CNN structure design, hierarchical attention scheme and embedded version design with deep compression.

Implementation details

Experiments are all carried out on a server, which is configured with Intel Xeon E5-2630v4 CPU (10cores and each with 2.2GHz), 64G memory and 1 piece of NVIDIA Titan X card. Meanwhile, we perform all embedded system related experiments on an ARM Cortex-A9 (4 cores @ 1.60GHz) Embedded system with 2G RAM. Training for CE-Text is performed by an Adam optimizer with adaptive learning rate, which is settled with an initial learning rate 0.001 and batch size 12. Compared to batch gradient descent

Conclusion and future work

In this work, we propose a lightweight and context-aware deep convolutional neural network (CNN) for text detection. The proposed method proposes a hierarchical text attention scheme, which captures context information by constructing multi-level channel attention modules. To fit with low computation resource of embedded systems, we further transform CE-Text into a lighter version with a frequency based deep CNN compression method. Experimental results on both workstations and embedded systems

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by National Key R&D Program of China under Grant 2021YFB3900601, the fundamental research funds for the central universities (32519113008,32519113006, 31512111310,B220202074) and by the open project from the State Key Laboratory for Novel Software Technology, Nanjing University, under Grant No. KFKT2019B17.

References (24)

H. Wang et al.
Rib segmentation algorithm for x-ray image based on unpaired sample augmentation and multi-scale network
Neur. Comput. Applic.
(2021)
S. Wan et al.
Edge computing enabled video segmentation for real-time traffic monitoring in internet of vehicles
Pattern Recognit.
(2022)
S. Ding et al.
Stimulus-driven and concept-driven analysis for image caption generation
Neurocomputing
(2020)
M. Liao et al.
Textboxes: a fast text detector with a single deep neural network
Proceedings of AAAI
(2017)
C. Chen et al.
Data dissemination for industry 4.0 applications in internet of vehicles based on short-term traffic prediction
ACM Transact. Internet Technol. (TOIT)
(2021)
X. Zhou et al.
EAST: an efficient and accurate scene text detector
Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition
(2017)
B. Shi et al.
Detecting oriented text in natural images by linking segments
Proceedings of CVPR
(2017)
C. Yao et al.
Scene text detection via holistic, multi-channel prediction
CoRR
(2016)
Y. Wu et al.
Self-organized text detection with minimal post-processing via border learning
Proceedings of 2017 IEEE International Conference on Computer Vision
(2017)
L. Wu et al.
A context-aware user-item representation learning for item recommendation
ACM Transact. Inform. Syst. (TOIS)
(2019)

S. Long et al.

Textsnake: A flexible representation for detecting text of arbitrary shapes

Proceedings of 2018 European Conference on Computer Vision

(2018)

P. Lyu et al.

Multi-oriented scene text detection via corner localization and region segmentation

Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition

(2018)

Cited by (8)

Texts as points: Scene text detection with point supervision
2023, Pattern Recognition Letters
Scene text detection is challenging due to the diverse text appearance, the complex background, and the expensive labeling of training data. For detecting arbitrary-shaped texts, most existing methods require heavy data labeling efforts to produce polygon-level annotations for supervised training. In order to reduce the cost in data labeling, we propose to combine center point annotation into mixed-supervised scene text detection, in which the dataset comprises small number of fully annotated images and large number of weakly annotated images by center points. For better incorporating point supervision, we adopt self-training strategy based on a detector which locates texts by predicting their centers. Besides, in order to weight the pseudo labels generated during self-training, we also propose a novel regression uncertainty estimation module to measure the quality of detection results. Extensive experiments on five benchmark datasets (ICDAR2015, C-SVT, CTW1500, Total-Text and ICDAR-ArT) show that using small amount of polygon annotated data and large amount of center point annotated data, our detector can achieve competitive detection performance.
Editorial paper for Pattern Recognition Letters VSI on cross model understanding for visual question answering
2022, Pattern Recognition Letters
Altered Handwritten Text Detection in Document Images Using Deep Learning
2024, International Journal of Pattern Recognition and Artificial Intelligence
Edge-Based IIoT Malware Detection for Mobile Devices With Offloading
2023, IEEE Transactions on Industrial Informatics
Robust Multi-task Learning-based Korean POS Tagging to Overcome Word Spacing Errors
2023, ACM Transactions on Asian and Low-Resource Language Information Processing
A Context-focused Attention Evolution Model for Aspect-based Sentiment Classification
2023, ACM Transactions on Asian and Low-Resource Language Information Processing

View all citing articles on Scopus

View full text

CE-text: A context-Aware and embedded text detector in natural scene images

Highlights

Abstract

Introduction

Section snippets

Text detection methods

The proposed method

Implementation details

Conclusion and future work

Declaration of Competing Interest

Acknowledgment

Neur. Comput. Applic.

Pattern Recognit.

Neurocomputing

Textboxes: a fast text detector with a single deep neural network

Proceedings of AAAI

Data dissemination for industry 4.0 applications in internet of vehicles based on short-term traffic prediction

ACM Transact. Internet Technol. (TOIT)

EAST: an efficient and accurate scene text detector

Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition

Detecting oriented text in natural images by linking segments

Proceedings of CVPR

Scene text detection via holistic, multi-channel prediction

CoRR

Self-organized text detection with minimal post-processing via border learning

Proceedings of 2017 IEEE International Conference on Computer Vision

A context-aware user-item representation learning for item recommendation

ACM Transact. Inform. Syst. (TOIS)

Textsnake: A flexible representation for detecting text of arbitrary shapes

Proceedings of 2018 European Conference on Computer Vision

Multi-oriented scene text detection via corner localization and region segmentation

Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition