Using Object Information for Spotting Text

Prasad, Shitala; Kong, Adams Wai Kin

doi:10.1007/978-3-030-01270-0_33

Using Object Information for Spotting Text

Shitala Prasad¹⁷ &
Adams Wai Kin Kong¹⁸

Conference paper
First Online: 06 October 2018

2700 Accesses
14 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11220))

Abstract

Text spotting, also called text detection, is a challenging computer vision task because of cluttered backgrounds, diverse imaging environments, various text sizes and similarity between some objects and characters, e.g., tyre and ‘o’. However, text spotting is a vital step in numerous AI and computer vision systems, such as autonomous robots and systems for visually impaired. Due to its potential applications and commercial values, researchers have proposed various deep architectures and methods for text spotting. These methods and architectures concentrate only on text in images, but neglect other information related to text. There exists a strong relationship between certain objects and the presence of text, such as signboards or the absence of text, such as trees. In this paper, a text spotting algorithm based on text and object dependency is proposed. The proposed algorithm consists of two sub-convolutional neural networks and three training stages. For this study, a new NTU-UTOI dataset containing over 22k non-synthetic images with 277k bounding boxes for text and 42 text-related object classes is established. According to our best knowledge, it is the second largest non-synthetic text image database. Experimental results on three benchmark datasets with clutter backgrounds, COCO-Text, MSRA-TD500 and SVT show that the proposed algorithm provides comparable performance to state-of-the-art text spotting methods. Experiments are also performed on our newly established dataset to investigate the effectiveness of object information for text spotting. The experimental results indicate that the object information contributes significantly on the performance gain.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Text understanding in natural images is an important prerequisite for many artificial intelligence (AI) and computer vision (CV) applications, such as autonomous robots, systems for visually impaired, context retrieval, and multi-language machine translation based on image inputs [1,2,3,4,5,6,7,8,9]. Researchers have demonstrated that once text is well detected, the existing text recognition methods can achieve high accuracy [4, 6]. Text spotting is the current bottleneck and is a challenging CV task, because backgrounds in natural scenes such as street view images are highly cluttered and the text in them has large difference in styles (e.g., artistic fonts, Time New Roman and Sim Sun with different colors), languages (e.g., Chinese, Japanese and English), sizes (e.g., text on a signboard of a cafe and text on its food menu board), illumination conditions (e.g., offices, restaurants, bars, sunny countryside, and cloudy streets), and contrasts (e.g., over-exposed and under-exposed). Other factors, including low resolution, out-of-focus, occlusion, and similarity between objects and characters (e.g., tyre and the character ‘o’) impose additional difficulties on text spotting [10]. Figure 1 illustrates some of these challenges. Thus, researchers are still actively seeking more robust and accurate text spotting methods.

Currently researchers concentrate on designing more effective deep network architectures and training schemes to seek more useful information in text, including character-level, word-level, text line-level and precise text location up to accuracy of one pixel [12, 13]. For particular applications, such as shopping assistants for grocery and book stores [14, 15], more prior knowledge can be exploited for achieving higher detection accuracy. More precisely, in these environments, text can appear in particular locations with a similar style and color and the backgrounds are more predictable. However, this prior knowledge is not generally applicable to natural scene images, which likely have clutter backgrounds, because there is no control over where and how the images are taken.

Even though images are not taken from a particular environment, we still have rough idea about them, because they are taken from where we stay, live, work, and travel, such as city, street, office, cafe, and park. Text appears likely on particular man-made objects, e.g., book, computer, and signboard but unlikely on natural objects, e.g., water, sky, tree, and grass. Some objects are more often with text than others. For example, text always appears on car plate but not always on the side of car. More clearly, objects and text are not independent. The appearance of text is typically dependent on the type of objects in the scene. Figure 2 illustrates few dependence between objects and text in street view images. Furthermore, this information is possible to reduce detection errors which are due to the similarity between objects and text, e.g., tyre and ‘O’. Once a car is detected, it implies that text unlikely appears in its bottom. According to the best knowledge of the authors, none of the previous studies exploited this information for detecting text in natural scene images. The aim of this paper is to develop an algorithm to exploit this information for enhancing text spotting performance. In this study, the authors are particularly interested in images with cluster backgrounds, such as images taken from streets because they are challenging even to the state-of-the-art methods and likely contain objects, the target of this study.

Text spotting can be considered as a specific case of object detection. In recent years, the advancements in object detection are driven by the region proposal (RP) methods [11, 16, 17]. Fast RCNN [18] and their latest developments [11] are some of these methods. Faster RCNN sharing the convolutional layers with the region proposal networks (RPNs) and fast RCNN, is one of the best among state-of-the-art methods in object detection with low computation cost [11]. Because of its performance in terms of accuracy and speed, it is selected for this study as a baseline network.

Converting faster RCNN to detect text in images with cluttered backgrounds can be done through training the network using images with their text labels only. However, this approach does not consider object information, which is the focus of this study. If the network is trained using images with object and text labels together as the original faster RCNN training procedure, possibly, the objects would degrade its performance for text because the network would balance its performance between text and other objects. Another approach is to encode the objects and text relationship on a knowledge graph, where each node represents a specific type of object or text and each edge describes how likely two objects or text and an object appear together. This approach can use faster RCNN to first detect objects and then use the adjacency matrix of the knowledge graph to refine the results from faster RCNN [11]. It can in fact be considered as a decision level fusion, because the final results from faster RCNN, which are the bounding boxes of the objects and text, are fused with the knowledge graph information. This approach neither makes use of the object features in faster RCNN nor optimizes the network end-to-end. These potential approaches are likely sub-optimal. In this paper, an algorithm is proposed to exploit object features and text features in a deep network directly and to train it end-to-end for achieving better performance.

For this study, a new text dataset named Nanyang Technological University Unconstrained Text and Object Image Dataset (NTU-UTOI) is established. This dataset contains 22,767 natural scene images with 165,749 bounding boxes for 42 classes of objects and 111,868 bounding boxes for text^{Footnote 1}, including English, Chinese and digits. Figure 2 shows samples in the NTU-UTOI dataset. More information about the dataset can be found in Sect. 4. According to our best knowledge, it is the second largest real (non-synthetic) natural scene image dataset for text spotting. NTU-UTOI is used for training and testing the proposed algorithm. In addition, three benchmarks from three different groups are also employed in the evaluations and comparisons: SVT^{Footnote 2}, MSRA-TD500^{Footnote 3}, and COCO-Text^{Footnote 4}. These three databases are challenging because their images are taken from diverse environments and with clutter backgrounds.

The rest of the paper is organized as follows: Sect. 2 gives a very brief summary of state-of-the-art text detecting methods. Section 3 elaborates the proposed algorithm. Section 4 reports comparison results with the state-of-the-art text detection methods on the three benchmark datasets along with NTU-UTOI dataset. Section 5 gives some conclusive remarks.

2 Related Works

Text detection in natural scene images has been studied for several decades [2, 12, 19, 20] and various methods have been proposed, which can be broadly categorized into character-region methods and sliding windows methods. The character-region methods aim to segment pixels into characters and then group the characters into words [12, 19,20,21,22,23,24] while the sliding window methods determine whether the pixels in a sliding window belong to text or not [9, 25,26,27]. Text detection can also be categorized as image processing-based methods and deep learning-based methods. The image processing-based methods pre-process images and then extract features and finally classify pixels into text and background. The deep learning methods exploit the capability of deep networks to automatically extract features and perform detection based on their feature maps. Generally speaking, deep learning methods perform better but demands more computational resources, particularly in training.

Epshtein et al. proposed a per-pixel output transformation called stroke width transform (SWT) for text detection [12]. Neumann and Matas [24] proposed a method based on gradient filters to detect oriented strokes, which significantly outperforms SWT. Anthimopoulos et al. proposed a sliding window method, which uses dynamically normalized edges as features and a random forest classifier to detect text in natural scene images [27]. Chen et al. used edge-enhanced maximally stable extremal regions (MSERs) for text detection [19]. It outperforms SWT because it is more robust to blurred images and more effective for filtering out false-positive characters. Posner et al. proposed a cascade of boosted classifiers with a saliency map to create bounding boxes for text detection [28]. In 2012, Wang et al. claimed to be the first group using convolutional neural network (CNN) for text spotting [29]. They trained a CNN on a synthetic dataset [8].

In recent years, researchers consider words and text lines as a whole generic object but ignore the character components such that generic object detectors can be modified for text detection [13]. In 2017, Rong et al. proposed a recurrent dense text localization network (DTLN) using long short term memory (LSTM) for unambiguous text localization and retrieval [15]. Zhong et al. modified faster RCNN for text detection [10]. Furthermore, Liao et al. proposed TextBoxes, which is inspired by Single Shot multibox Detector (SSD) [30], to achieve higher detection accuracy and speed [31].

In fact, text can be considered as a generic object as discussed earlier. Using deep learning and region proposal network (RPN) for generic object detection has attracted great attention from many researchers. The state-of-the-art object detection methods based on RPN have achieved very significant improvement [18, 32] comparing with the traditional methods. In addition to faster RCNN, there are other region proposal methods, such as selective search (SS) [33], multiscale combinatorial grouping (MCG) [34], and edge-boxes (EB) [35]. These methods generate exceedingly large amount of region proposals, resulting in high recall but more computation demanding. To overcome this problem, RPN computes region proposals through sharing convolutional layers with fast RCNN that exponentially reduces the computational cost and achieves a promising recall rate. Inspired by [11], in this paper, RPN is trained on same images with object labels and then combined with another deep network and trained together on images with text labels. Researchers have proposed deep learning models and trained them on large datasets such as COCO-Text and SynthText [36, 37] but none of them exploited object information nearby text.

3 Methodology

This section first describes the proposed deep network architecture and training stages. Then, anchor parameters, which are designed for text spotting are given. The loss function for training the network and the implementation details are provided in the end of this section.

3.1 Network Architecture and Training Stages

To use object features in deep networks for enhancing text spotting^{Footnote 5} performance, a convolutional neural network (CNN) with two sub-networks and three training stages is proposed. The proposed deep network is named Text and Object-based CNN (TO-CNN). Figure 3 illustrates the proposed deep network and training stages. In this study, faster RCNN with VGG-16 net [38] as a backbone is used to extract object and text information. At the first training stage, faster RCNN is trained on images with text and object labels illustrated in Fig. 3(a). Once the network is fully trained, the object and text information would be stored in the VGG-16 net. For the sake of convenience, the trained VGG-16 net is called Object VGG-16 net. Note that it does store text information. Object VGG-16 net is separated from other components in the faster RCNN. A CNN which is modified from another VGG-16 network is added on the Object VGG-16 net. This CNN is called Text VGG-16 net. The details of the modification will be given later. The Object VGG-16 and the Text VGG-16 together form the backbone of TO-CNN. TO-CNN also consists of RPN and the regression networks from faster RCNN illustrated in Fig. 3(b). At the second training stage, TO-CNN is trained on images with text labels only and all parameters in the Object VGG-16 net are fixed. In this stage, the Text VGG-16 net takes the object and text features from the Object VGG-16 to tune its parameters for text detection. From another point of view, the Text VGG-16 net fuses the text and object features for text detection. At the third training stage, the entire TO-CNN, including the Text VGG-16 net and the Object VGG-16 net is fine-tuned. At the end of this training stage, the network is fully optimized for text spotting based on object and text information.

The Text VGG-16 net is modified to take input feature maps from the Object VGG-16 net. There are different approaches to merge two networks together [39,40,41]. The stacked hourglass approach [40] is one of the effective approaches. In this paper, following the similar hourglass approach, the output of the Object VGG-16 net is up-sampled and combined to the Text VGG-16 net adding three up-sampling and one normalization layers for further RPN learning process.

In order to detect objects with different sizes, faster RCNN uses hyper-parameters, i.e., scale and ratio to control the region proposals. Ren et al. used three scales to determine the size of sliding anchors: 8, 16 and 32 with three aspect ratios: 1:1, 1:2 and 2:1 [11]. In TO-CNN, the scale is also fixed to three levels but the aspect ratio is modified, as their aspect ratios were designed for generic object detection. Text usually has different aspect ratios compared to objects, and therefore new aspect ratios are set to 1:1, 1:2, 2:1, 1:5 and 5:1 to cover almost all text lines and words in images. The summary of the anchors used in the proposed network is given in Fig. 3(b) top-left. Note that, in each point on the final feature map, there are 15 anchors $(5\times 3)$ at each sliding position. So for a convolutional map of $W \times H$, there are $W\times H\times 15$ anchors.

TO-CNN uses the same translation-invariant property of RPN [11], which results in 2,397,696^{Footnote 6} parameters in the proposal layer. More clearly, if text is translated in an image, the proposal will also be translated and the same function will be used to predict the proposal regardless of their translated locations.

3.2 Loss Functions

In the first training stage, the original loss function in faster RCNN is employed to extract object information. In the second and third training stages, the multi-task loss function $\mathfrak {L}$ given below is used [42]

$$\begin{aligned} \mathfrak {L}(p_l,v,v^*)=\mathfrak {L}_{cls} (p_l)+ \alpha \mathfrak {L}_{reg} (v,v^*) \end{aligned}$$

(1)

where $l=1$ and $l=0$ represent text and background, respectively, $p_l$ is the corresponding probability computed using softmax, $\mathfrak {L}_{cls}$ is a classification loss and $\mathfrak {L}_{reg}$ is a regression loss between predicted and ground truth bounding boxes, $\alpha $ is a weight balancing these two losses and v and $v^*$ are the predicted and ground truth bounding boxes, respectively. The bounding boxes are represented by their top left corner coordinates, width and height, i.e., $\{v_x,v_y,v_w ,v_h\}$ for v and $\{v_x^*,v_y^* ,v_w^* ,v_h^*\}$ for $v^*$. The classification and regression losses are defined respectively in Eqs. 2 and 3,

$$\begin{aligned} \mathfrak {L}_{cls} (p_l) = -\log p_l \end{aligned}$$

(2)

$$\begin{aligned} \mathfrak {L}_{reg} (v,v^*) = \sum _{i\epsilon \{x,y,w,h\}} smoothL_1(v_i - v^*) \end{aligned}$$

(3)

where

$$\begin{aligned} smoothL_1(x) = \left\{ \begin{matrix} 0.5x^2&{} if |x|<1 \\ |x| - 0.5 &{} otherwise \end{matrix}\right. \end{aligned}$$

(4)

In this paper, $smoothL_1$ loss is used as it is less sensitive to outliers and needs less attention on tuning the learning rate [13]. As with RPN, here the features used for regression are of the same dimension, which is 3 by 3 on the feature maps. This helps in achieving bounding box regression more efficiently [11].

3.3 Training and Implementation Details

The Object CNN and the Text CNN are initialized by VGG-16 pre-trained ImageNet classification model [38]. The weights are updated using $10^{-3}$ and $10^{-4}$ for the first 100,000 and next 350,000 iterations, respectively. The base learning rate is $10^{-3}$ and the learning rate decay parameter $\gamma $ is 0.1. The weight decay and momentum are set to $\omega = 5\times 10^{-4}$ and $\mu =0.9$, respectively. These parameters are employed in all three training stages.

All the experiments are conducted on Intel Xeon E5-2690 CPU workstation with 32GB RAM, NVIDIA Quadro M6000 24GB and Ubuntu 14.04 OS. Caffe^{Footnote 7} is used to implement TO-CNN.

4 Experiments and Results

Three benchmark datasets: SVT, MSRA-TD500 and COCO-Text are employed to evaluate the performance of the proposed algorithm. These three databases are challenging even for the state-of-the-art methods because their images were collected from diverse environments, including inner and outdoor environments under different lighting conditions and have clutter backgrounds. The COCO-Text dataset [43] is a subset of the MS COCO dataset [44], which is used for studying object detection task. It contains 63k images taken from complex everyday scenes from which 10k is used for validation and 10k for testing. Figure 4(a) shows sample images from COCO-Text dataset. MSRA-TD500 is a multi-lingual dataset that includes both English and Chinese text along with digits in high resolution. MSRA-TD500 consists of 500 natural scene images. Out of them 200 are testing images and 300 of them are training images. Figure 4(b) shows sample images of MSRA-TD500 dataset. The street view text (SVT) dataset consists of images collected from Google street view and is annotated in word-level. It consists of smaller and lower resolution text from street view. SVT has 100 images for training and 249 images for testing with total 647 annotated words (not fully annotated). It is challenging as it has few incomplete and/or occluded texts with low image quality. Figure 4(c) shows some sample images from this dataset.

In addition to these three different benchmark datasets, TO-CNN is also examined on NTU-UTOI dataset established by the authors. NTU-UTOI dataset consists of 22,767 images from ICDAR 2011 robust scene text^{Footnote 8}, ICDAR 2015 incident scene text^{Footnote 9}, KAIST scene text^{Footnote 10}, MSRA-TD500, NEOCR^{Footnote 11}, SVT, USTB-SV1k [3], and Traffic Sign datasets [45], together with images collected from the Internet and authors’ personal collections. 18,173 images are used for training and the rest 4,594 images are used for testing. It should be emphasized that the training set of NTU-UTOI neither contains any testing images from COCO-Text, MSRA-TD500 nor SVT. Thus, TO-CNN could be trained on the training set of NTU-UTOI and examined on the testing sets of COCO-Text, MSRA-TD500 and SVT. The sample images from NTU-UTOI dataset are shown in Fig. 4(d). Text and 42 object classes, which positively associate or negatively associate with text, were labeled. They are common street view object. Table 1 lists all the classes. The labels are selected because they have strong relationship with text and commonly appear in natural scene images. Totally, 277,617 bounding boxes for text and text related objects were manually labeled and cross verified by two workers per image.

Table 1. The object labels of the NTU-UTOI dataset and the frequency counts.

Full size table

The NTU-UTOI dataset is also a challenging dataset. The images were collected from various imaging environments with patterns similar to text (e.g., windows are similar to “D”, “O” and “0”, railings similar to “1” and “l”, and tires similar to “o” and “O”) and also with multi-lingual, multi-oriented and multi-scale text. Moreover, it contains blurred and incidental text and images from indoor, outdoor, street, crowd, road, poster and mobile/TV screens. Some examples are given in Figs. 2 and 4(d).

Precision (P), recall (R) and F-score (F) are used as performance indexes to evaluate the proposed algorithm and compare it with the state-of-the-art text spotting methods. MSRA-TD500 and SVT have been extensively used as benchmarks for algorithm evaluation and COCO-Text is a newly released benchmark. Different research groups use different datasets to evaluate their methods and train them on different datasets. For each of the benchmark datasets, the methods reported with state-of-the-art results are selected for comparisons. Thus, different methods are selected in these comparisons. Their training sets and the baseline networks are also listed in the resultant tables. Note that IoU (intersection over union) in this paper is taken as 0.5 to be the correct match. Tables 2, 3 and 4 list respectively the precision, recall and F-score from MSRA-TD500, SVT and COCO-Text. Figures 5, 6 and 7 show sample outputs of MSRA-TD500, SVT and COCO-Text, respectively.

Table 2 shows the comparisons among TO-CNN and the state-of-the-art methods on MSRA-TD500. TO-CNN achieves the best results in terms of precision, recall and F-score. TO-CNN achieves precision rate of 0.87, which is same as EAST [37] and Lyu et al. [46]. Because of the object information in TO-CNN, it achieves recall rate of 0.90, which is significantly higher than all the other methods by at least 0.14. Figure 5 shows some outputs from MSRA-TD500.

Table 2. Comparison on the MSRA-TD500 dataset.

Full size table

Table 3. Comparison on the SVT dataset.

Full size table

Table 3 lists the results from TO-CNN and the state-of-the-art methods on SVT. TO-CNN achieves precision rate of 0.95, recall rate of 0.75 and F-score of 0.84. Its precision and recall rates are significantly higher than the other methods at least 0.27 and 0.12 respectively. Figure 6 shows some detection results of TO-CNN. Comparing the precision rates, recall rates and F-scores of the other methods on the two datasets, it is noted that SVT is more challenging. TO-CNN still provides stable performance for SVT.

Table 4. Comparison on the COCO-Text dataset.

Full size table

COCO-Text contains 63k images with 173k labeled text regions mainly focusing English text regions. In process of training TO-CNN, it is first trained using object and text labels in NTU-UTOI in phase one and then trained using text labels in COCO-Text in the second and third training stages. TO-CNN provides comparable results in terms of precision, recall and F-score (see Table 4 and Fig. 7). Methods A, B and C developed by Google, TextSpotter and VGG have performance 0.36, 0.19 and 0.07 [43]. TO-CNN achieves the highest recall rate and F-score.

Table 5. Comparison on the NTU-UTOI dataset.

Full size table

Table 6. Faster R-CNN fine-tuned on NTU-UTOI text dataset.

Full size table

Comparisons on NTU-UTOI dataset are shown in Table 5 for demonstrating the usefulness of object information in text spotting. Here, it is compared with RCNN and faster RCNN methods, which are the base of TO-CNN. It is also compared with the other state-of-the-art methods. For object dependency test, TO-CNN is also trained on text labels only (second last row). The experimental results show that without object information, TO-CNN and faster RCNN perform similarly. Training on images with object labels, TO-CNN outperforms RCNN, faster RCNN and TO-CNN without object information significantly. These results show clearly that objects contain valuable information for text spotting. The precision, recall and F-score for the 1st-3rd stages of TO-CNN are {0.59, 0.33, 0.42}, {0.65, 0.53, 0.59} and {0.70, 0.62, 0.66}, respectively. Some visual outputs of NTU-UTOI dataset are shown in Fig. 8 that includes images taken in different environments and lighting conditions and proves that the proposed algorithm works well in these cases. It even works well for dense text scenes, as shown in Fig. 8.

To store object information in the network, the proposed algorithm combines two sub-networks. However, its size is not the largest one among the state-of-the-art text spotting networks. To further analyze how object information impacts on text detection, Fig. 10 shows the percentages of four types of objects containing text and their corresponding recall and precision rates from NTU-UTOI testing set. TreePlant and Animals have negative dependence with text while CarPlate and SignBoard have positive dependence. For negative dependent objects, precision rates from TO-CNN perform better than its recall but for positive dependent objects, the recall rates are better. Note that the positive dependent objects degrade the network without object information a lot. It means that the text on objects is influenced by the objects. Note that in Fig. 10, precision and recall rates are calculated based on the text and the selected object only showing their dependency on text and the selected objects. That is, if the total number of carplate images in the test set is considered to be 100% then the text overlapping is 57% leading to precision 34% and 41% without and with object information, respectively.

Table 7. Performance of TO-CNN on NTU-UTOI with various anchors.

Full size table

Catastrophic forgetting, which is a common problem in neural network, is not observed in our study. The experimental results in Table 5 show that the proposed algorithm does not suffer from such issues. The term TO-CNN without object in Table 5 means removing the object labels in the training set but keeping the same depth. We also tested two pre-trained models from faster RCNN and then fine-tuned on NTU-UTOI text data (Table 6). First was pre-trained on regular COCO objects and the other network was trained on NTU-UTOI dataset.

Lastly, to show significance of different scales and aspect ratios of RPN anchors, we experimented different anchor parameters on NTU-UTOI dataset, results shown in Table 9. According to this, improving anchors size and shape actually enhances the performance.

5 Conclusion

Traditionally, researchers solely used information in text for text spotting in natural scene images and objects in these images were totally neglected. Objects and text have in fact strong dependence. In this paper, TO-CNN with three training stages is proposed to exploit object information for text spotting. TO-CNN achieves comparable results to the state-of-the-art methods on COCO-Text, MSRA-TD500 and SVT. The experimental results show that object information is vital for improving text detection accuracy, in particular for recall rate. Currently, TO-CNN uses a linear network architecture. The authors will investigate other network architectures to exploit the object information more effectively and implement cluster-based RPN anchor selection.

Notes

1.
In NTU-UTOI, the term text means English, Chinese and Digit.
2.
http://tc11.cvc.uab.es/datasets/SVT_1.
3.
http://www.iapr-tc11.org/mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500).
4.
https://vision.cornell.edu/se3/coco-text-2/.
5.
Here, by text spotting we mean text detection and not text recognition.
6.
The dimensions of feature map, reg and cls are 512, 4, and 1 respectively. The kernel size is 3 by 3 and the number of anchors is 15. Thus, the number of parameters is $3\times 3\times 512\times 512 + 512\times 15\times (4+1) = 2,397,696$.
7.
http://caffe.berkeleyvision.org/.
8.
http://www.cvc.uab.es/icdar2011competition/?com=introduction.
9.
http://rrc.cvc.uab.es/?ch=4&com=introduction.
10.
http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database.
11.
http://www.iapr-tc11.org/mediawiki/index.php?title=NEOCR:_Natural_Environment_OCR_Dataset.

References

Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4159–4167 (2016)
Google Scholar
Yin, X.C., Pei, W.Y., Zhang, J., Hao, H.W.: Multi-orientation scene text detection with adaptive clustering. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1930–1937 (2015)
Article Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)
Article MathSciNet Google Scholar
He, T., Huang, W., Qiao, Y., Yao, J.: Text-attentional convolutional neural network for scene text detection. IEEE Trans. Image Process. 25(6), 2529–2541 (2016)
Article MathSciNet Google Scholar
He, P., Huang, W., Qiao, Y., Loy, C.C., Tang, X.: Reading scene text in deep convolutional sequences. In: AAAI, pp. 3501–3508 (2016)
Google Scholar
Busta, M., Neumann, L., Matas, J.: Fastext: efficient unconstrained scene text detector. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1206–1214 (2015)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)
Google Scholar
Chen, X., Yuille, A.L.: Detecting and reading text in natural scenes. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2004, Vol. 2, pp. II–II. IEEE (2004)
Google Scholar
Zhong, Z., Jin, L., Zhang, S., Feng, Z.: Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2963–2970. IEEE (2010)
Google Scholar
He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented scene text detection. arXiv preprint arXiv:1703.08289 (2017)
Xiong, B., Grauman, K.: Text detection in stores using a repetition prior. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. IEEE (2016)
Google Scholar
Rong, X., Yi, C., Tian, Y.: Unambiguous text localization and retrieval for cluttered scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5494–5502 (2017)
Google Scholar
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Chen, H., Tsai, S.S., Schroth, G., Chen, D.M., Grzeszczuk, R., Girod, B.: Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: 2011 18th IEEE International Conference on Image Processing (ICIP), pp. 2609–2612. IEEE (2011)
Google Scholar
Huang, W., Lin, Z., Yang, J., Wang, J.: Text localization in natural images using stroke feature transform and text covariance descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1241–1248 (2013)
Google Scholar
Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 497–511. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_33
Chapter Google Scholar
Yi, C., Tian, Y.: Text extraction from scene images by character appearance and structure modeling. Comput. Vis. Image Underst. 117(2), 182–194 (2013)
Article Google Scholar
Yi, C., Tian, Y.: Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans. Image Process. 20(9), 2594–2605 (2011)
Article MathSciNet Google Scholar
Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3538–3545. IEEE (2012)
Google Scholar
Minetto, R., Thome, N., Cord, M., Leite, N.J., Stolfi, J.: Snoopertext: a text detection system for automatic indexing of urban scenes. Comput. Vis. Image Underst. 122, 92–104 (2014)
Article Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1457–1464. IEEE (2011)
Google Scholar
Anthimopoulos, M., Gatos, B., Pratikakis, I.: Detection of artificial and scene text in images and video frames. Pattern Anal. Appl. 16(3), 431–446 (2013)
Article MathSciNet Google Scholar
Posner, I., Corke, P., Newman, P.: Using text-spotting to query the world. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3181–3186. IEEE (2010)
Google Scholar
Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3304–3308. IEEE (2012)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network. In: AAAI, pp. 4161–4167 (2017)
Google Scholar
Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware cnn model. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1134–1142 (2015)
Google Scholar
Van de Sande, K.E., Uijlings, J.R., Gevers, T., Smeulders, A.W.: Segmentation as selective search for object recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1879–1886. IEEE (2011)
Google Scholar
Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 328–335 (2014)
Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Chapter Google Scholar
Liao, M., Shi, B., Bai, X.: Textboxes++: a single-shot oriented scene text detector. arXiv preprint arXiv:1801.02765 (2018)
Zhou, X., et al.: East: an efficient and accurate scene text detector. arXiv preprint arXiv:1704.03155 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Park, E., Han, X., Berg, T.L., Berg, A.C.: Combining multiple sources of knowledge in deep CNNs for action recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016)
Google Scholar
Yang, J., Liu, Q., Zhang, K.: Stacked hourglass network for robust facial landmark localisation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2025–2033. IEEE (2017)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Ma, J., et al.: Arbitrary-oriented scene text detection via rotation proposals. arXiv preprint arXiv:1703.01086 (2017)
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Larsson, F., Felsberg, M.: Using fourier descriptors and spatial models for traffic sign recognition. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688, pp. 238–249. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21227-7_23
Chapter Google Scholar
Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. arXiv preprint arXiv:1802.08948 (2018)
Kang, L., Li, Y., Doermann, D.: Orientation robust text line detection in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4034–4041 (2014)
Google Scholar
Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1083–1090. IEEE (2012)
Google Scholar
Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 970–983 (2014)
Article Google Scholar
Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., Cao, Z.: Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016)
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. arXiv preprint arXiv:1703.06520 (2017)
Mao, J., Li, H., Zhou, W., Yan, S., Tian, Q.: Scale based region growing for scene text detection. In: Proceedings of the 21st ACM international conference on Multimedia, pp. 1007–1016. ACM (2013)
Google Scholar
Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in natural scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2558–2567 (2015)
Google Scholar
Bušta, M., Neumann, L., Matas, J.: Deep textspotter: an end-to-end trainable scene text localization and recognition framework (2017)
Google Scholar
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar

Download references

Acknowledgement

Authors would like to thank BAE Systems Applied Intelligence as this work is supported and funded by them under the research collaboration BAE-NTU fund at Cyber Security Research Centre @ NTU Singapore.

Author information

Authors and Affiliations

Cyber Security Research Center (CYSREN), Nanyang Technological University, Singapore, Singapore
Shitala Prasad
School of Science and Computer Engineering, Nanyang Technological University, Singapore, Singapore
Adams Wai Kin Kong

Authors

Shitala Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Adams Wai Kin Kong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shitala Prasad .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prasad, S., Kong, A.W.K. (2018). Using Object Information for Spotting Text. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11220. Springer, Cham. https://doi.org/10.1007/978-3-030-01270-0_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-01270-0_33
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01269-4
Online ISBN: 978-3-030-01270-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Works

3 Methodology

3.1 Network Architecture and Training Stages

3.2 Loss Functions

3.3 Training and Implementation Details

4 Experiments and Results

5 Conclusion

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation