CE-text: A context-Aware and embedded text detector in natural scene images
Introduction
Due to the large variations in text and complex backgrounds, many deep learning based techniques have been proposed to improve the accuracy and robustness of text detection. However, these methods greatly suffer from slow optimization and detection speed, since each individual component must be trained and parameter tuning separately. There exists a trend in directly predicting word bounding boxes through an lightweight and single neural network. For example, [1] uses a single fully-convolutional network coping with bounding boxes of extreme aspect ratios to perform fast text detection [2]. However, such methods generally regard text as one class of objects without involving unique text characteristics for higher accuracy. Moreover, their proposed method are only deployed and tested on systems with Nvidia Titan GPUs, due to their high requirement for computation resource.
With this context, we build a context-aware and embedded text detector named as CE-Text to help detect text in natural scene images with high accuracy and efficiency. Fig. 1 shows the workflow of CE-Text, where step (a) inputs a natural scene image, step (b) represents a hierarchical channel-wise attention scheme to generate text context-aware feature map for higher detection accuracy, a lightweight convolutional neural network builded on feature map is proposed in (c) to predict multiple bounding boxes and corresponding text existence probabilities, step (d) presents an embedded version of text detector constructed by a frequency-based CNN compressing method, and step (e) obtains text detection results with bounding boxes and associated probabilities. Our motivations to involve techniques mainly rely on the following two considerations:
1) To overcome factors for wrong predictions, it’s essential to construct highly discriminative features for accurate detection. By stacking different layers, a CNN extracts image features through a hierarchical representation of visual abstractions. Therefore, features extracted from CNN structure are essentially channel-wise and multi-layer. However, not all the features are equally important and informative for detection of text components. Therefore, we propose a hierarchical attention scheme to encode text context-aware information, which automatically focuses on text-related characteristics and discriminative feature channels for accurate text detection.
2) The main difficulty to shift deep neural networks to embedded systems lies in their high request for computational and storage intensity. For example, original YOLO is over 230MB in size, containing 30 layers and parameters. Running YOLO with such a tremendous number of parameters consumes large storage and computational resources. Due to the nature of local pixel correlation in images, i.e., spatial locality, parameters of channel filters in the proposed text detectors or other visual content analysis systems tend to be smooth. Therefore, we convert parameters matrix into frequency domain and only utilize their low-frequency parts. The embedded version of CE-Text is thus appropriate to apply in embedding systems with low storage and memory size.
The contributions of this paper are three-fold:
- •
We propose a novel deep and context-aware CNN structure for text detection, in which a task-specified hierarchical attention scheme is adopted to enhance feature representative ability on the basis of multi-level and channel-wise context information.
- •
A hierarchically channel-wise attention scheme is carefully designed with channel-wise and multi-layer features, involving inherent and unique context characteristics of text components for better detection results.
- •
CE-Text adopts a novel frequency-based deep compression method to build a lightweight and embedded text detector, which have properties of highly-representative feature map, fast computing speed and small storage size.
Section snippets
Text detection methods
We generally category recent methods on text detection as regression based and segmentation based text detection [3]. Scene text detection have been greatly affected by the thought of directly predicting location without proposals and post-processing steps. EAST [4] propose a simple yet powerful pipeline that yields fast and accurate text detection for words or text lines of arbitrary orientations with a single neural network. SegLink [5] first predicts text segments and then links text
The proposed method
In this section, we describe CE-Text by lightweight CNN structure design, hierarchical attention scheme and embedded version design with deep compression.
Implementation details
Experiments are all carried out on a server, which is configured with Intel Xeon E5-2630v4 CPU (10cores and each with 2.2GHz), 64G memory and 1 piece of NVIDIA Titan X card. Meanwhile, we perform all embedded system related experiments on an ARM Cortex-A9 (4 cores @ 1.60GHz) Embedded system with 2G RAM. Training for CE-Text is performed by an Adam optimizer with adaptive learning rate, which is settled with an initial learning rate 0.001 and batch size 12. Compared to batch gradient descent
Conclusion and future work
In this work, we propose a lightweight and context-aware deep convolutional neural network (CNN) for text detection. The proposed method proposes a hierarchical text attention scheme, which captures context information by constructing multi-level channel attention modules. To fit with low computation resource of embedded systems, we further transform CE-Text into a lighter version with a frequency based deep CNN compression method. Experimental results on both workstations and embedded systems
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported by National Key R&D Program of China under Grant 2021YFB3900601, the fundamental research funds for the central universities (32519113008,32519113006, 31512111310,B220202074) and by the open project from the State Key Laboratory for Novel Software Technology, Nanjing University, under Grant No. KFKT2019B17.
References (24)
- et al.
Rib segmentation algorithm for x-ray image based on unpaired sample augmentation and multi-scale network
Neur. Comput. Applic.
(2021) - et al.
Edge computing enabled video segmentation for real-time traffic monitoring in internet of vehicles
Pattern Recognit.
(2022) - et al.
Stimulus-driven and concept-driven analysis for image caption generation
Neurocomputing
(2020) - et al.
Textboxes: a fast text detector with a single deep neural network
Proceedings of AAAI
(2017) - et al.
Data dissemination for industry 4.0 applications in internet of vehicles based on short-term traffic prediction
ACM Transact. Internet Technol. (TOIT)
(2021) - et al.
EAST: an efficient and accurate scene text detector
Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Detecting oriented text in natural images by linking segments
Proceedings of CVPR
(2017) - et al.
Scene text detection via holistic, multi-channel prediction
CoRR
(2016) - et al.
Self-organized text detection with minimal post-processing via border learning
Proceedings of 2017 IEEE International Conference on Computer Vision
(2017) - et al.
A context-aware user-item representation learning for item recommendation
ACM Transact. Inform. Syst. (TOIS)
(2019)
Textsnake: A flexible representation for detecting text of arbitrary shapes
Proceedings of 2018 European Conference on Computer Vision
Multi-oriented scene text detection via corner localization and region segmentation
Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition
Cited by (8)
Texts as points: Scene text detection with point supervision
2023, Pattern Recognition LettersEditorial paper for Pattern Recognition Letters VSI on cross model understanding for visual question answering
2022, Pattern Recognition LettersAltered Handwritten Text Detection in Document Images Using Deep Learning
2024, International Journal of Pattern Recognition and Artificial IntelligenceEdge-Based IIoT Malware Detection for Mobile Devices With Offloading
2023, IEEE Transactions on Industrial InformaticsRobust Multi-task Learning-based Korean POS Tagging to Overcome Word Spacing Errors
2023, ACM Transactions on Asian and Low-Resource Language Information ProcessingA Context-focused Attention Evolution Model for Aspect-based Sentiment Classification
2023, ACM Transactions on Asian and Low-Resource Language Information Processing