Unattached irregular scene text rectification with refined objective

doi:10.1016/j.neucom.2021.08.047

Neurocomputing

Volume 463, 6 November 2021, Pages 101-108

https://doi.org/10.1016/j.neucom.2021.08.047 Get rights and content

Abstract

Recently, text recognition tasks have reached a high level with deep learning-based methods. The techniques are widely applied in different fields, and nowadays most researchers aim to build an effective approach to deal with irregular text in scene images. In this work, we propose a GAN-based framework to rectify scene text with rotation, curving, or other distortions. Unlike previous rectification modules that rely on the recognition networks, our model can be utilized either as an independent model or an extra component. Therefore, annotations of the text content are not required to train the model. And we utilize a refined training objective with the proposed sample loss, which is able to effectively control pixels in the output images that are supposed to be sampled from input ones. Experiments on public benchmarks demonstrate the effectiveness of our method. The code will be publicly available on github soon.

Introduction

Text in scene images usually convey critical information for image analysis, so text recognition is of great significance in the landscape of computer vision. Previous methods usually need to segment the characters in a word at first, and then recognize each character. But over recent years, end-to-end techniques based on deep learning have been extensively applied, and CNN-based methods for text recognition have also sprung up [1], [2], [3], [4]. Excellent achievements have been achieved, and these works mostly have a framework with three modules: convolution layers, recurrent layers, and the transcription layer. This scheme eliminates the demands of the segmentation of characters, which avoids the error accumulating problems. However, there is still a premise that is the arrangement of the text is basically from left to right and there is no severe distortion. If the text is rotated, curved, or distorted by a shift in perspective, this procedure may fail to recognize the words.

In order to deal with irregular text in scene images, researchers were devoted to solving the issue. Two mainstream ideas were proposed: to improve the modules of networks, or to rectify the image before recognition. The former kind of approaches usually focused on the architecture of the networks. They aimed to improve the shape of receptive fields of the convolution layers or strategies of the attention mechanism [5]. The latter one tried to find effective rectification modules, and utilize them in the front of recognition networks [2], [3]. Baek et al. [4] summarized most of the models for text recognizing with different combinations of each module, and they discovered that spatial transformer networks (STN) [6] have been widely utilized to process irregular text before recognition. The best combination they proposed also relies on STN [6] as the pre-processing step. Though all of the methods achieved excellent performances, to utilize STN [6] seems more adaptable owing that they are able to be involved in any state-of-the-art recognition networks in the future.

Whereas there are some advantages of the methods to rectify text images before recognition, rectification modules are scarce to build a robust model facing various applications. And it is disturbing that most of the rectification models need to be trained with the recognition network simultaneously. Though there is little influence on recognizing, this feature restricts the scenarios of applications because the training process requires annotations of the text contents. In some circumstances, the rectification is applied for better showing rather than recognizing, so the redundant annotations will cause a waste of time-consuming. To mitigate this issue, in this paper, we propose an unattached scene text rectification model to deal with irregular text images. It is based on Generative Adversarial Networks (GAN) [7] which can be trained independently. The generator has an encoder-decoder framework, and ResNeXt-50 [8] is utilized as the encoder. The decoder consists of some transposed convolutions with Residual Blocks (ResBlock) [9] and skip connections [10] involved. Then we also utilize a refined training objective to optimize the model, including a new proposed sample loss which is highly effective in the domination of pixels in output images that are supposed to be sampled from the input one. Experiments on public datasets demonstrate the effectiveness of our proposed method. To summarize, there are three main contributions of this work:

•
We propose an unattached model for scene text rectification. It can be trained without recognition modules or the annotations of text content and can be utilized either as an independent model or as an extra module in the front of recognition models.
•
A refined training objective is presented to build a robust and effective model. Four loss functions are involved, and the proposed sample loss shows competitive effectiveness in this rectification task.
•
The code and dataset will be publicly available soon. To facilitate the reference of researches for more progress, we will upload related materials in the future.

The rest of this paper is organized as below. Section 2 introduces the related works. Section 3 describes the proposed method in detail, including the network architecture and the training objective. The experimental results are given in Section 4, and the analysis is provided. Finally, Section 5 discusses our method and gives the conclusion of this paper.

Section snippets

Text recognition

Scene text recognition is always an attractive research field. From methods that require segmentation of characters to the CNN-based end-to-end frameworks, related methods have been developed for a long time, and achieve superb performance. Researchers provided some excellent frameworks [1], [3], [11], but it is still troublesome when facing irregular text. There is rotation, curving, or other distortions and the characters may not be arranged from left to right as usual, and the features could

Methodology

In this section, we describe the architecture of our model. The presented network has a framework based on GAN [7] with a generator that consists of a coarse-to-fine architecture. And the training objective of the model is carefully refined to achieve better performance. We propose sample loss to help our network reach satisfactory effectiveness.

Training details

The network takes a gray image as the input with the size of $200 \times 64$ . And because the proposed methods are all supposed to keep the text unchanged in our design, paired irregular and regular text images are required for effective training. The optimizer of the network is Adam [31] with a learning rate 1e-2, and the optimizer of the discriminator is Stochastic Gradient Descent (SGD) with a learning rate 1e-4. The batch size is set to 128. The network is trained for 100 epochs, which consumes

Conclusion

In this paper, we propose an unattached scene text rectification method with a refined objective. The proposed network can work independently to rectify text images for recognition or other usages. And it can be trained without annotations of text content, which reduces the workload of annotating. The proposed sample loss is able to control pixels in the output images that are supposed to be sampled from the input one, and it performs well in the rectification tasks. Experiments on public

CRediT authorship contribution statement

Yanxiang Gong: Conceptualization, Methodology, Software, Writing - original draft. Linjie Deng: Methodology, Investigation. Zhiqiang Zhang: Visualization, Validation. Guozhen Duan: Validation. Zheng Ma: Writing - review & editing. Mei Xie: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by the National Key Research and Development Program of China [ID: 2018AAA0103203].

Yanxiang Gong received the B.E. degree in electronic and information engineering from University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2017. He is currently pursuing the Ph.D degree in information and communication engineering with UESTC, Chengdu, China. His research interests include computer vision and machine learning.

References (50)

P.P. Roy et al.
Date-field retrieval in scene image and video frames using text enhancement and shape coding
Neurocomputing
(2018)
A. Sain et al.
Multi-oriented text detection and verification in video frames and scene images
Neurocomputing
(2018)
L. Deng et al.
Detecting multi-oriented text with corner-based region proposals
Neurocomputing
(2019)
C. Luo et al.
Moran: A multi-object rectified attention network for scene text recognition
Pattern Recognition
(2019)
B. Shi et al.
An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2016)
B. Shi et al.
Robust scene text recognition with automatic rectification, in
W. Liu, C. Chen, K.-Y.K. Wong, Z. Su, J. Han, Star-net: a spatial attention residue network for scene text...
J. Baek et al.
What is wrong with scene text recognition model comparisons? dataset and model analysis
H. Li, P. Wang, C. Shen, G. Zhang, Show, attend and read: A simple and strong baseline for irregular text recognition,...
M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial transformer networks, in: Proceedings of the 28th...

I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative...

S. Xie et al.

Aggregated residual transformations for deep neural networks, in

K. He et al.

Deep residual learning for image recognition

O. Ronneberger et al.

U-net: Convolutional networks for biomedical image segmentation

F. Borisyuk et al.

Rosetta: Large scale system for text detection and recognition in images

P. Keserwani et al.

Quadbox: Quadrilateral bounding box based scene text detection using vector regression

IEEE Access

(2021)

P.P. Roy et al.

Convex hull based approach for multi-oriented character recognition from graphical documents

A.K. Bhunia et al.

Text recognition in scene image and video frame using color channel selection

Multimedia Tools and Applications

(2018)

L. Deng et al.

Focus-enhanced scene text recognition with deformable convolutions

J. Dai et al.

Deformable convolutional networks

V. Mnih, N. Heess, A. Graves, K. Kavukcuoglu, Recurrent models of visual attention, in: Proceedings of the 27th...

C.-Y. Lee et al.

Recursive recurrent nets with attention modeling for ocr in the wild

Z. Cheng et al.

Focusing attention: Towards accurate text recognition in natural images

F. Zhan et al.

Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition

IEEE/CVF International Conference on Computer Vision

(2019)

B. Shi et al.

Aster: An attentional scene text recognizer with flexible rectification

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2018)

Cited by (0)

Linjie Deng received the M.S. degree from Sichuan University in 2015. He is currently a Ph.D. candidate with the School of Information and Communication Engineering at University of Electronic Science and Technology of China (UESTC). His research interests are related to the application of computer vision and machine learning.

Zhiqiang Zhang received the B.E. degree from University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2019. He is currently pursuing the M.S. degree in information and communication engineering from UESTC, Chengdu, China. He is a student member of the IEEE, majoring in imaging processing and computer vision.

Guozhen Duan received the B.E. from Civil Aviation University Of China, Tianjin, China, in 2020. He is currently pursuing the M.S. degree in electronic information from University of Electronic Science and Technology of China, Chengdu, China. His research interests are related to the application of computer vision and machine learning.

Zheng Ma is a professor with the School of Information and Communication Engineering at University of Electronic Science and Technology of China (UESTC). His research interests include convex optimization, computer vision and image processing.

Mei Xie is a professor with School of Information and Communication Engineering at University of Electronic Science and Technology of China (UESTC). She received the Ph.D. degree in signal and information processing (SIP) from UESTC in 1997. Between 1997 and 1999, she studied at University of Hong Kong and University of Texas for the postdoctoral degree, respectively. Her research interests include pattern recognition, computer vision, and artificial intelligence.

View full text

Unattached irregular scene text rectification with refined objective

Abstract

Introduction

Section snippets

Text recognition

Methodology

Training details

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Neurocomputing

Neurocomputing

Neurocomputing

Pattern Recognition

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence

Robust scene text recognition with automatic rectification, in

What is wrong with scene text recognition model comparisons? dataset and model analysis

Aggregated residual transformations for deep neural networks, in

Deep residual learning for image recognition

U-net: Convolutional networks for biomedical image segmentation

Rosetta: Large scale system for text detection and recognition in images

Quadbox: Quadrilateral bounding box based scene text detection using vector regression

IEEE Access

Convex hull based approach for multi-oriented character recognition from graphical documents

Text recognition in scene image and video frame using color channel selection

Multimedia Tools and Applications

Focus-enhanced scene text recognition with deformable convolutions

Deformable convolutional networks

Recursive recurrent nets with attention modeling for ocr in the wild

Focusing attention: Towards accurate text recognition in natural images

Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition

IEEE/CVF International Conference on Computer Vision

Aster: An attentional scene text recognizer with flexible rectification

IEEE Transactions on Pattern Analysis and Machine Intelligence