Abstract
Scene text detection is a challenging problem due to the image cluttering and high variability of text shape. Many methods have been proposed for multi-oriented and arbitrary shape text detection, in which the storage and computation costs of deep neural networks are still concerns. In this paper, we first introduce Octave Convolution into scene text detection for enlarging the receptive fields and reducing the spatial redundancy of networks. Then we combine Octave Convolution with a state-of-the-art arbitrary shape text detector PSENet, which predicts different scale of kernels for each text instance. Experimental results on several benchmarks show that the proposed method can improve both detection performance and speed in detecting multi-oriented and arbitrary shape texts. Furthermore, our method achieves state-of-the-art performances on these benchmarks.
Keywords
This work is supported by the National Natural Science Foundation of China (NSFC) Grants 61721004, 61733007 and 61602004.
1 Introduction
Text detection from natural scene images has attracted intensive attention in recent years, due to the potential of wide applications and the multitude of technical challenges. Except for the challenges of complex backgrounds and illumination change which exist in generic object detection, scene text detection also suffers from the challenges of variable aspect ratios, orientation, and arbitrary shape. Some examples are shown in Fig. 1. To cope with these challenges, many scene text detectors, based on deep neural networks (DNNs), try to improve performance by enlarging the receptive fields. Some methods try to use a deeper backbone network to improve the performance, but this will slow down the inference speed of the detector. The recently proposed PSENet [27] is an example which tries to enlarge the receptive fields by deepening the backbone network for improving the performance. Alternatively, some methods try to change the way of convolution to enlarge the receptive fields. For example, Textboxes [11] modifies convolutional kernels to get larger receptive fields. Some methods [12, 30] take Deformable Convolution [2] as an integral part of them to get flexible receptive fields. Although these methods enlarge the receptive fields, they also increase the number of parameters.
As a plug-and-play convolutional unit, Octave Convolution [1] aims to reduce the costs of memory and computation while enlarging the receptive fields of neural network. Comparing with Deformable Convolution, Octave Convolution does not increase the number of parameters. For enlarging the receptive fields and enhancing the ability of network to extract features, we adopt Octave Convolution instead of normal convolution to scene text detection for achieving high accuracy and speed in arbitrary shape text detection. PSENet can separate the adjacent text instances by predicting several scale kernels of text, and gets the full size of text by Progressive Scale Expansion Algorithm. By combining Octave Convolution with PSENet, we can achieve high detection performance while speeding up the process of inference. To the best of our knowledge, this is the first time to apply Octave Convolution to scene text detection.
The contributions of this paper are summarized as follows:
-
(1)
To enlarge the receptive fields and enhance the ability of network to extract features, we combine Octave Convolution with PSENet. By utilizing the more contextual information, the detection performance can be boosted without increasing memory and computation costs.
-
(2)
Our experiments on several datasets of both curved and multi-oriented texts show that the proposed method achieves state-of-the-art performances and runs faster than PSENet.
2 Related Work
In this section, we first review representative methods of scene text detection, then introduce some works on how to enlarge the receptive fields.
2.1 Scene Text Detection
Scene text detection methods proposed in recent years are mostly based on DNNs. They can be roughly categorized into two groups: regression-based and segmentation-based.
Regression-Based Methods. Regression-based methods often base on general object detection frameworks, such as Faster-RCNN [23], YOLO [22], SSD [15]. Seglink [24] adopts SSD to detect multi-oriented scene text by first detecting two locally detectable elements: segments and links. Textboxes [11] modifies convolutional kernels and anchor boxes to effectively capture various text shapes. RRD [13] is proposed to get more accurate bounding boxes. Inceptext [30] designs an Inception-Text Module to handle text with multiple scales, aspect ratios and orientations.
Different from the above methods, EAST [33], DeepReg [6] and FOTS [16] directly output the values for the position and size of text instance from a given point. These methods do not need complex anchor design, but they need the same size of receptive fields as the size of text instances and may fail to detect extremely long text.
Segmentation-Based Methods. Segmentation-based methods are mainly inspired by FCN [17], and have an advantage in detecting curved text. For these methods, they may fail to separate text instances which are close to each other. PixelLink [3] predicts pixel connections to separate adjacent texts. TextSnake [18] uses the text center line map to separate adjacent text instances. TextField [29] tries to learn a direction field to distinguish the adjacent text instances. PSENet [27] generates different scale of kernels for each text instance. By the minimal kernels, the close text instances can be distinguished. However, due to the pixel-wise prediction and time-consuming post-processing steps, the speed of it is quite slow.
2.2 Methods for Enlarging Receptive Fields
Dilated Convolution [31] has been widely used in semantic segmentation and object detection. Its motivation is to increase the receptive fields without additional parameter cost by performing convolution at sparsely sampled locations. Different from the fixed sampling way, Deformable Convolution [2] adds 2D offsets to the regular grid sampling location in the standard convolution. These offsets are learned from the preceding feature maps via additional convolutional layers. Octave Convolution [1] is proposed to reduce the spatial redundancy in CNNs while enlarging the receptive fields. It enlarges the receptive fields by factorizing the feature maps along the channel dimension into high-frequency maps and low-frequency maps. The receptive fields are enlarged when convolution is applied on the compressed low-frequency feature maps. To cope with the variable aspect ratios and shape of scene text, it is intuitive to apply Octave Convolution to scene text detection.
3 Proposed Method
3.1 Overall Pipeline
An overview of our method is illustrated in Fig. 2. The pipeline utilizes a fully convolutional network (FCN) to produce different segmentation masks corresponding to different scale kernels of scene text. Octave Convolution is used as a direct replacement of normal convolution in the backbone network. Inspired by Feature Pyramid Network (FPN) [14], low-level feature maps and high-level feature maps are concatenated. As shown in Fig. 2(a), the outputs of each stage (except for the last stage) of the replaced backbone network are two feature maps, one high-frequency feature map \(X^H\), and one low-frequency feature map \(X^L\), which is the half size of the high-frequency feature map. Before used in FPN, the low-frequency feature map is up-sampled to the size of the high-frequency feature map, then the up-sampled feature map and the high-frequency feature map are concatenated for each stage (except for the last stage). This is illustrated in Fig. 2(b).
The feature maps from FPN are further fused in F to facilitate the generations of the kernels with various scales. We get four 256 channels feature maps (i.e. \(P_2,P_3,P_4,P_5\)) from the backbone. The function \(\mathbf {C}\) is used for getting F with 1024 channels:
where “||” refers to the concatenation and \(Up_{2}(.)\), \(Up_{4}(.)\), \(Up_{8}(.)\) refer to 2, 4, 8 times up-sampling, respectively. Subsequently, F is fed into Conv(3, 3)-BN-ReLU layers and is reduced to 256 channels. Next, it passes through n Conv(1, 1)-Up-Sigmoid layers and produces n segmentation results \(S_1, S_2, ..., S_n\), the width and height of them is 1/1 of the input image. Here, BN and Up refer to batch normalization [7] and up-sampling.
Each \(S_i\) would be one segmentation mask for all text instances at a certain scale, which is decided by the hyper-parameters which will be introduced in later section. Among these masks, \(S_1\) represents the minimal kernel among these predicted kernels, and \(S_n\) denotes for the maximal kernel. After obtaining these segmentation masks, the Progressive Scale Expansion Algorithm gradually makes use of these segmentation masks to get the full size of text instances.
3.2 Octave Convolution
The advantages of Octave Convolution are enlarging the receptive fields and reducing the costs of memory and computation. The output feature maps of a convolution layer can be factorized along the channel dimension into high-frequency maps and low-frequency maps. The mentioned advantages are realized by reducing the resolution for low-frequency maps.
Let \(X\in \mathbb {R}^{c\times h\times w} \) denotes the input feature tensor of a convolutional layer, where h and w denote the spatial dimensions and c the number of feature maps or channels. The input feature tensor X can be factorized along the channel dimension into \(X = \{X^H, X^L\}\), where the high-frequency feature maps \(X^H\in \mathbb {R}^{(1-\alpha )c\times h\times w}\) capture fine details and the low-frequency maps \(X^L\in \mathbb {R}^{\alpha c\times \frac{h}{2}\times \frac{w}{2}}\) vary slower in the spatial dimensions. Here \(\alpha \in [0,1]\) denotes the ratio of channels allocated to the low-frequency part. The low-frequency feature maps are at half of the spatial resolution of the high frequency ones.
Let X, Y be the factorized input and output tensors. Similarly, the high- and low-frequency feature maps of the output tensors \(Y = \{Y^H, Y^L\}\) will be obtained by \(Y^H = Y^{H \rightarrow H} + Y^{L \rightarrow H}\) and \(Y^L = Y^{L \rightarrow L} + Y^{H \rightarrow L}\), respectively, where \(Y^{A \rightarrow B}\) denotes the convolutional update from feature map group A to group B. Specifically, \(Y^{H \rightarrow H}\), \(Y^{L \rightarrow L}\) denote intra-frequency information update, while \(Y^{H \rightarrow L}\), \(Y^{L \rightarrow H}\) denote inter-frequency communication. Figure 3 shows the detailed process of the Octave Convolution. The output \(Y = {Y^H, Y^L}\) can be summarized as follows:
where f(X; W) denotes a convolution with parameters W, pool(X, k) is an average pooling operation with kernel size \(k \times k\) and stride k. upsample(X, k) is an up-sampling operation by a factor of k via nearest interpolation.
3.3 Label Generation
These corresponding ground truths with different kernel scales can be conducted by shrinking the original text instance. Subsequently, each shrunk polygon \(p_i\) is transferred into a 0/1 binary mask for segmentation label ground truth. These ground truth maps are denoted as \(G_1, G_2, ..., G_n\) respectively. The scale ratio is defined as \(r_i\), the margin \(d_i\) between \(p_n\) and \(p_i\) can be calculated as:
where Area(.) is the function of computing the polygon area, Perimeter(.) is the function of computing the polygon perimeter. The scale ratio \(r_i\) for ground truth map \(G_i\) is defined as:
where m is the minimal scale ratio, which is a value in (0, 1]. Based on the definition in Eq. (4), the values of scale ratios \((i.e., r_1, r_2, ..., r_n)\) are decided by two hyper-parameters n and m, and they increase linearly from m to 1.
3.4 Loss Function
The loss function is composed of two terms: the complete text instances loss \(L_c\) and the shrunk ones loss \(L_s\). It can be formulated as:
where \(\lambda \) balances the importance between \(L_c\) and \(L_s\). These two losses \(L_c\) and \(L_s\) are computed by dice coefficient loss. The dice coefficient D is formulated as:
where \(S_{i,x,y}\) and \(G_{i,x,y}\) refer to the value of pixel (x, y) in segmentation result \(S_i\) and ground truth \(G_i\), respectively. To better distinguish many patterns similar to text strokes, Online Hard Example Mining (OHEM) [25] is adopted to \(L_c\). The training mask given by OHEM is defined as M, and thus \(L_c\) can be formulated as:
\(L_s\) is the loss for shrunk text instances. Since they are encircled by the original areas of the complete text instances, the pixels of the non-text region in the segmentation result \(S_n\) are ignored to avoid a certain redundancy. Therefore, \(L_s\) can be formulated as follows:
W is a mask which ignores the pixels of the non-text region in \(S_n\), and \(S_{n,x,y}\) refers to the value of pixel (x, y) in \(S_n\).
3.5 Inference
During the inference stage, the outputs from the trained network are n segmentation results \(S_1, S_2, ..., S_n\). We use Progressive Scale Expansion Algorithm [27] to obtain the final detection results. As mentioned before, \(S_1\) is the minimal kernel which is used for separating the adjacent text instances. The Progressive Scale Expansion Algorithm is an iterative algorithm, and its central idea is brought from the Breadth-First-Search (BFS) algorithm.
In the first step, the central parts of texts in \(S_1\) are the initial detection results. The next iteration steps are similar to progressively get the full size of scene texts. In the second step, the pixels in \(S_2\) will be judged to belong to which text and are merged to the kernel of that text. Similarly, the pixels in \(S_3, ..., S_n\) are merged to get the final detection results. The advantage of “progressively” is the detection results are not influenced by the margin between two close scene texts. Note that there may be conflicted pixels which are hard to be judged and this often happens for the margin pixels. In practice, these pixels are only be merged by one single kernel on a first-come-first-served basis.
4 Experimental Results
To validate the combination of Octave Convolution is beneficial to the detection performance and speed, we compare our method with other state-of-the-art methods on three challenging public datasets: CTW1500, ICDAR 2015 and ICDAR 2017 MLT.
4.1 Datasets
CTW1500 [32] is a challenging dataset consisting of long curve texts. It consists of 1000 training images and 500 testing images. The text instances are labeled by 14 points at word-level.
ICDAR 2015(IC15) [8] is a challenging dataset for multi-oriented text detection. It consists of 1000 training images and 500 testing images. The annotations are at word-level using quadrilateral boxes. There are many blurred text regions labeled Do Not Care.
ICDAR 2017 MLT (IC17-MLT) [21] is a large scale multi-lingual text dataset, which contains 7200 training images, 1800 validation images and 9000 testing images. This dataset is composed of complete scene images from 9 languages. Some languages like English are labeled at word-level, while other languages such as Chinese are labeled at line-level. Like IC15, text regions are also annotated by the 4 vertices of quadrilaterals and hard text instances are labeled as Do Not Care.
4.2 Implementation Details
We use the ResNet50 [5] with Octave Convolution pre-trained on ImageNet [4] as our backbone. The networks are optimized with stochastic gradient descent (SGD). We set \(\alpha \) in Octave Convolution to 0.5, the kernel number n and the minimal kernel scale m are set to 7 and 0.4, respectively. And the \(\lambda \) of loss balance is set to 0.7.
Data augmentation the same with PSENet is used during training. For IC17-MLT, we only use its dataset to train our model. But for CTW1500 and IC15, there are two strategies: (1) Training from scratch. (2) Fine-tuning on IC17-MLT model.
During test stage, we set the longer side of images to 1280 on CTW1500. And on IC15, the longer side of the input images is scaled to 2240. On IC17-MLT, the longer side of test images is set to 3200. The whole algorithm is implemented in PyTorch 1.0 and we conduct all experiments on a regular workstation whose CPU is Intel(R) Core(TM) i7-7700K and GPU is GeForce GTX 1080ti.
4.3 Comparisons with State-of-the-Art Methods
Detecting Curve Text. To test the effectiveness for curved text detection, we first evaluate our method on CTW1500. We report the single-scale performance of our method in Table 1. On CTW1500, our method surpasses all the counterparts. Without external data which means training from scratch, the F-measure of PSENet is 78.0%, and the speed of it is 3.9 FPS. The corresponding result of our method is 81.57%, which is 3.57% higher than PSENet. In addition, the speed of our method is about 2 FPS faster than PSENet. With external data, our method achieves 83.19% in F-measure and is about 1% higher than PSENet. The experiment demonstrates the effectiveness of enlarging receptive fields for detectors to handle curve texts. We show some test examples in Fig. 4(a).
Detecting Oriented Text. We evaluate the proposed model on IC15 to validate the effectiveness for oriented text detection. We compare our method with other state-of-the-art methods in Table 2. Without external data, PSENet achieves 80.57% in F-measure, and our method achieves 82.83% in F-measure. Also, our method is still faster than PSENet. With external data, our method still outperforms PSENet. Some detection results are shown in Fig. 4(b).
Detecting Multi-Lingual Text. On IC17-MLT, our method achieves 72.22% in the F-measure, higher than PSENet which takes ResNet152 as the backbone network. In speed, our method is 0.3 FPS faster than PSENet whose backbone network is ResNet152 under the same condition. This experiment validates that the performance of scene text detector can be improved by enlarging the receptive fields for multi-lingual text detection. Some test examples are shown in Fig. 4(c).
5 Conclusion
In this paper, we first introduce Octave Convolution into scene text detection to expand the receptive fields of network for capturing more contextual information. With the larger receptive fields, the performance of detector can be highly improved. In addition, Octave Convolution can reduce both memory and computation cost, which is beneficial for the efficiency of scene text detectors. The experimental results on several datasets validate the effectiveness of our method. In the future, we will try other ways (e.g., [9]) to expand the receptive fields and further improve the speed of our method.
References
Chen, Y., et al.: Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. arXiv preprint arXiv:1904.05049 (2019)
Dai, J., et al.: Deformable convolutional networks. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 764–773 (2017)
Deng, D., Liu, H., Li, X., Cai, D.: PixelLink: detecting scene text via instance segmentation. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented scene text detection. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 745–753 (2017)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: Proceedings of International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)
Li, Y., Chen, Y., Wang, N., Zhang, Z.: Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892 (2019)
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27(8), 3676–3690 (2018)
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2017)
Liao, M., et al.: Scene text recognition from two-dimensional perspective. arXiv preprint arXiv:1809.06508 (2018)
Liao, M., Zhu, Z., Shi, B., Xia, G.S., Bai, X.: Rotation-sensitive regression for oriented scene text detection. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 5909–5918 (2018)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 5676–5685 (2018)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015)
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: TextSnake: a flexible representation for detecting text of arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 19–35. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_2
Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 71–88. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_5
Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 7553–7563 (2018)
Nayef, N., et al.: ICDAR 2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In: Proceedings of International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1454–1459. IEEE (2017)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Neural Information Processing Systems (NeurIPS), pp. 91–99 (2015)
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2550–2558 (2017)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 761–769 (2016)
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Wang, W., et al.: Shape robust text detection with progressive scale expansion network. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (2019)
Wang, X., Jiang, Y., Luo, Z., Liu, C.L., Choi, H., Kim, S.: Arbitrary shape scene text detection with adaptive text region representation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 6449–6458 (2019)
Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., Bai, X.: TextField: learning a deep direction field for irregular scene text detection. IEEE Trans. Image Process. 28, 5566–5579 (2019)
Yang, Q., et al.: IncepText: a new inception-text module with deformable PSROI pooling for multi-oriented scene text detection. arXiv preprint arXiv:1805.01167 (2018)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yuliang, L., Lianwen, J., Shuaitao, Z., Sheng, Z.: Detecting curve text in the wild: new dataset and new solution. arXiv preprint arXiv:1712.02170 (2017)
Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 5551–5560 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Yan, S., Feng, W., Zhao, P., Liu, CL. (2020). Progressive Scale Expansion Network with Octave Convolution for Arbitrary Shape Scene Text Detection. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12046. Springer, Cham. https://doi.org/10.1007/978-3-030-41404-7_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-41404-7_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41403-0
Online ISBN: 978-3-030-41404-7
eBook Packages: Computer ScienceComputer Science (R0)