Elsevier

Neurocomputing

Volume 333, 14 March 2019, Pages 284-291
Neurocomputing

Elite Loss for scene text detection

https://doi.org/10.1016/j.neucom.2018.12.009Get rights and content

Abstract

Many scene text detection approaches generate foreground segmentation maps to detect the text instances. In these methods, usually all the pixels within the bounding box regions of the text are equally treated as foreground during the training process. However, different from the general object segmentation problem, we argue that not all the pixels across the text bounding box region contribute equally for locating the text instance. Specifically, some in-box not-on-stroke pixels even degrade the detection performance. Moreover, for the segmentation based methods with a regression step applied to predict the corresponding bounding box on each pixel, not all the pixels need to be fully trained to predict foreground texts. Therefore, in this paper, we propose Elite Loss, which is intended to down-weight the contributions of the in-box not-on-stoke pixels while paying more attention to the on-stoke pixels. Furthermore, we design a segmentation-based method to validate the effectiveness of the proposed Elite Loss. Extensive experiments demonstrate that our methods achieve the state-of-the-art results on all three challenging datasets, with the F-score of 0.855 on ICDAR2015, 0.425 on COCO-Text, and 0.819 on MSRA-TD500.

Introduction

Scene text detection has attracted more and more attention for its important role in many computer vision tasks. It aims at localizing the text with bounding boxes of words or text lines. Various methods have been proposed to tackle this problem [1], [2], [3], [4], [5]. The main challenges of scene text detection are the large variety of texts in scales, layouts, fonts, and orientations, as well as the cluttered background which is easily confused with text. Traditional text detection approaches are mostly bottom-up [6], [7], [8]. Lots of hand-craft features are designed to distinguish text from background regions. But these methods perform poorly on complex scenes.

Recently, many deep learning based methods are proposed to detect texts. Some methods are evolved from general object detection methods like SSD [9] or Faster RCNN [10]. They utilized the reference boxes to detect text, such as TextBoxes [11], TextBoxes++ [12], SegLink [1], RRPN [5], R2CNN [13], RRD [14] FSTN [15], etc. These methods shows some improvement over traditional approaches, but they cannot deal with the multi-oriented texts well, as discussed in [4].

Another category of deep learning based methods are inspired by FCN [16], which is widely used in various segmentation tasks. Zhang et al. [17] and He et al. [18] used FCN to locate raw text regions, but they need complicated post-processing steps. Lyu et al. [19] uses corners and position-sensitive segmentation maps to segment each text instances. Liu et al. [20] proposed MCN, in which they designed the Markov Clustering Network to group the segmented foreground pixels into text instances. Recently, some methods such as EAST [2] and Direct Regression [4], [21] are proposed. They regress a bounding box on each pixel’s location, in an end-to-end way. We call these methods segmentation based methods because they detect text in a segmentation-like style.

The segmentation based methods detect the text instances by pixel-wise predictions. However, due to the lack of fine-grained text segmentation annotations, they usually assume all the pixels inside the text’s bounding box as foreground. This is different from the general object segmentation task, where the sub-regions in the target segmentation map have consistent counters with the original objects. We consider that the use of the raw ground truth in the segmentation step is a trade-off caused by the lack of ground truth segmentation maps of text strokes. However, this leads to the optimization error in the backward propagation process of the networks. Fig. 1(c) shows an output score map of a segmentation based detector [2] after the first iteration when it is being fine-tuned from ImageNet-pretrained models. It can be seen that the detector has high activations on the stroke areas and relatively low activations on the smooth areas outside the strokes. Obviously, the pixels on the in-box outside-stroke areas cannot be easily distinguished from the outside-box background pixels for the similar appearance. This means that pixels on the strokes capture the most distinctive characteristics of the text and thus they are easy to learn, comparing to the pixels on the smooth outside-stroke region. For example, from Fig. 1 (a–c), it can be seen that the pixel A has a stronger response at its corresponding location on the score map than the pixel B after the first iteration.

This paper takes example for the category of segmentation based methods which are with an additional regression step [2], [4], to further inspect the above phenomenon. These methods perform the classification and bounding box regression on each pixel of the output segmentation maps to detect the text instances. Each pixel outputs one text instance bounding box if it is classified as positive. We can conclude that the pixels of the score map are the basic elements for the detection task and act independently of each other. In this paper, we call these basic elements as predicting units to better describe their roles. These methods treat all the predicting units of the same corresponding text instance equally. In this case, the total training loss of all the predicting units of the text instance may be easily dominated by those in-box not-on-stroke predicting units, for they usually have high loss value. However, the essential on-stroke predicting units are less considered, for they usually have small loss value. Meanwhile, since many of the not-on-stroke predicting units tend to have similar appearances with predicting units on the background (e.g., the unit at the location of pixel B in Fig. 1), forcing the network to distinguish the hard not-on-stroke predicting units as foreground may lead to many false detections on the background region.

From another view, for the predicting units of the same corresponding text instance, their tasks are duplicated. That is they are all able to find the same text instance independently. Thus it is unnecessary to cost much on those not-on-stroke but hard predicting units. The detector should focus on predicting units that represent the text instance better to learn a robust text detector.

It is worth to note that the not-on-stroke but hard predicting units are not the same as the hard samples chosen from all training samples in the hard example mining [22] or bootstrapping [23] procedure that should be considered more to benefit the performance. The hard predicting units here are the positive samples which are relatively hard but unnecessary to learn. These not-on-stroke but hard predicting units can also be regarded as units with noise labels in a sense, because their appearance is similar to the background and it is unreasonable to consider them the typical positive samples during training.

As analyzed above, we consider the on-stroke predicting units capture the instinct characteristics of text regions, so we call them as elite predicting units. In this paper, we propose a new loss re-weighting strategy to train the detector. We re-weight the classification losses of the prediction units to automatically lower the not-on-stroke but hard predicting units’ contributions to the loss during training. This helps the detector to focus on learning better features to correctly classify the elite on-stroke predicting units on the foreground. We call this reweighed loss as Elite Loss, for the reason that it focuses more on those elite predicting units (i.e., the on-stroke prediction units), and pays less attention to the noisy predicting units, i.e. not-on-stroke predicting units. The Elite Loss is flexible in its specific forms and it is effective. It improves our self-built baseline detector significantly and reaches the new state-of-art on various benchmarks. In Fig. 1, we use the method of [24] to show the difference about effective receptive fields of the same pixel of score map at the location of point A with and without the Elite Loss. We can conclude that the Elite Loss makes the effective receptive field more concentrated. This is beneficial to the pixel-wise classification task, since the surrounding noises are largely suppressed.

The contributions of this paper are listed as follows:

  • We propose the Elite Loss in segmentation based text detection networks which have a regression step, by down-weighting the contributions of the in-box not-on-stoke pixels to improve the training performance.

  • To demonstrate the effectiveness and the flexibility of Elite Loss, we design two specific forms of Elite Loss and evaluate them in the task of text detection.

  • With the Elite loss integrated into the segmentation based text detector which has a regression step, we achieve state-of-the-art results on various datasets.

Section snippets

Task specific loss functions

For the general object detection, Focal Loss [25] is proposed to handle the class-imbalance problem by focusing hard examples and down-weight the easy examples. Differently, Elite Loss is to handle the imprecise labels of text instance’s predicting units. It down-weights the not-on-stroke examples that are unnecessary or even may degrade the detector’s performance. For the robust estimation task, Huber Loss [26] also down-weights the hard-to-learn samples, which is regarded as outliers for

Elite Loss for scene text detection

The Elite Loss is proposed for the segmentation based detectors, in which each pixel is a predicting unit. It is intended to force the training on the on-stroke predicting units and discard the in-box not-on-stroke predicting units which are unnecessary and may harm the detector. We first present the definition of Elite Loss, which is evolved from the existing classification loss function for text detection, in Section 3.1. Then Section 3.2 gives a discussion on the specific rules on

Elite Loss on segmentation based text detector

We choose EAST [2] as the representative method of segmentation based text detectors and demonstrate the effectiveness of Elite Loss. EAST adopts FCN to generate a pixel-level text score map representing the presence of text and geometry maps encoding the word ’s bounding boxes. In EAST, the pixels in the shrunken bounding boxes are regarded as foreground. As shown in Fig. 3 three modifications are conducted to enhance the original implementation and we call the new baseline as EAST+. Firstly,

Experiment

In this section, we do extensive experiments on three public benchmark datasets to verify the effectiveness of our Elite Loss.

Conclusion

In this paper, we propose the Elite Loss to address the problem that current segmentation labels are unsuitable for the network to learn, in segmentation based text detection methods. We found that for the segmentation based methods which have a regression step, each pixel location on the output feature map is an independent predicting unit. Instead of considering all the predicting units equally, the Elite Loss reweights them with their contributions to the detector’s performance. It forces

Acknowledgments

This work was supported by Natural Science Foundation of China under Grants 61772527, 61806200, and 61876086.

Xu Zhao received the B.E. degree in 2014 from Dalian University of Technology of China. He is now pursuing a Ph.D. degree on pattern recognition and intelligence systems as a student at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, since 2014. His research interests include object detection, scene text detection, image and video processing, and intelligent video surveillance.

References (43)

  • B. Shi et al.

    Detecting oriented text in natural images by linking segments

    Proceedings of the CVPR, IEEE

    (2017)
  • X. Zhou et al.

    East: an efficient and accurate scene text detector

    Proceedings of the CVPR, IEEE

    (2017)
  • C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, Z. Cao, Scene text detection via holistic, multi-channel prediction, CoRR....
  • W. He et al.

    Deep direct regression for multi-oriented scene text detection

    Proceedings of the ICCV, IEEE

    (2017)
  • J. Ma et al.

    Arbitrary-oriented scene text detection via rotation proposals

    IEEE TMM

    (2018)
  • B. Epshtein et al.

    Detecting text in natural scenes with stroke width transform

    Proceedings of the ICCV, IEEE

    (2010)
  • L. Neumann et al.

    Real-time scene text localization and recognition

    Proceedings of the CVPR, IEEE

    (2012)
  • M. Jaderberg et al.

    Deep features for text spotting

    Proceedings of the ECCV, IEEE

    (2014)
  • W. Liu et al.

    SSD: single shot multibox detector

    Proceedings of the ECCV

    (2016)
  • S. Ren et al.

    Faster r-cnn: towards real-time object detection with region proposal networks

    Proceedings of the NIPS

    (2015)
  • M. Liao et al.

    Textboxes: a fast text detector with a single deep neural network

    Proceedings of the AAAI

    (2017)
  • M. Liao et al.

    Textboxes++: a single-shot oriented scene text detector

    IEEE TIP

    (2018)
  • Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, Z. Luo, R2CNN: rotational region CNN for orientation robust...
  • M. Liao et al.

    Rotation-sensitive regression for oriented scene text detection

    Proceedings of the CVPR, IEEE

    (2018)
  • Y. Dai et al.

    Fused text segmentation networks for multi-oriented scene text detection

  • E. Shelhamer et al.

    Fully convolutional networks for semantic segmentation

    IEEE TPAMI

    (2017)
  • Z. Zhang et al.

    Multi-oriented text detection with fully convolutional networks

    Proceedings of the CVPR, IEEE

    (2016)
  • T. He et al.

    Text-attentional convolutional neural network for scene text detection

    TIP

    (2016)
  • P. Lyu, C. Yao, W. Wu, S. Yan, X. Bai, Multi-oriented scene text detection via corner localization and region...
  • Z. Liu et al.

    Learning markov clustering networks for scene text detection

  • W. He et al.

    Multi-oriented and multi-lingual scene text detection with direct regression

    IEEE TIP

    (2018)
  • Cited by (0)

    Xu Zhao received the B.E. degree in 2014 from Dalian University of Technology of China. He is now pursuing a Ph.D. degree on pattern recognition and intelligence systems as a student at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, since 2014. His research interests include object detection, scene text detection, image and video processing, and intelligent video surveillance.

    Chaoyang Zhao received the B.E. degree and the M.S. degree in 2009 and 2012 respectively from University of Electronic Science and Technology of China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2016. He is currently an Assistant Professor in National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include object detection, image and video processing and intelligent video surveillance.

    Haiyun Guo received her B.E. degree from Wuhan University in 2013 and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, University of Chinese Academy of Sciences, in 2018. She is currently an assistant researcher in the National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image and video processing, and intelligent video surveillance.

    Yousong Zhu received the B.E. degree from Central South University, Changsha, China, in 2014. He is currently pursuing the Ph.D. degree in pattern recognition and intelligence systems with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include object detection, video object detection, pattern recognition and machine learning, and intelligent video surveillance.

    Ming Tang received the B.S. degree in computer science and engineering and M.S. degree in artificial intelligence from Zhejiang University, Hangzhou, China, in 1984 and 1987, respectively, and the Ph.D. degree in pattern recognition and intelligent system from the Chinese Academy of Sciences, Beijing, China, in 2002. He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include computer vision and machine learning.

    Jinqiao Wang received the B.E. degree in 2001 from Hebei University of Technology, China, and the M.S. degree in 2004 from Tianjin University, China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2008. He is currently a Professor with Chinese Academy of Sciences. His research interests include pattern recognition and machine learning, image and video processing, mobile multimedia, and intelligent video surveillance.

    View full text