Abstract
Scene text detection in the wild is a hot research area in the field of computer vision, which has achieved great progress with the aid of deep learning. However, training deep text detection models needs large amounts of annotations such as bounding boxes and quadrangles, which is laborious and expensive. Although synthetic data is easier to acquire, the model trained on this data has large performance gap with that trained on real data because of domain shift. To address this problem, we propose a novel two-stage framework for cost-efficient scene text detection. Specifically, in order to unleash the power of synthetic data, we design an unsupervised domain adaptation scheme consisting of Entropy-aware Global Transfer (EGT) and Text Region Transfer (TRT) to pre-train the model. Furthermore, we utilize minimal actively annotated and enhanced pseudo labeled real samples to fine-tune the model, aiming at saving the annotation cost. In this framework, both the diversity of the synthetic data and the reality of the unlabeled real data are fully exploited. Extensive experiments on various benchmarks show that the proposed framework significantly outperforms the baseline, and achieves desirable performance with even a few labeled real datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, D., et al.: Cross-domain scene text detection via pixel and image-level adaptation. In: ICONIP, pp. 135–143 (2019)
Chen, Y., Wang, W., Zhou, Y., Yang, F., Yang, D., Wang, W.: Self-training for domain adaptive scene text detection. In: ICPR, pp. 850–857 (2021)
Chen, Y., Zhou, Y., Yang, D., Wang, W.: Constrained relation network for character detection in scene images. In: PRICAI, pp. 137–149 (2019)
Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster r-cnn for object detection in the wild. In: CVPR, pp. 3339–3348 (2018)
Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: detecting scene text via instance segmentation. In: AAAI, vol. 32 (2018)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML, pp. 1180–1189 (2015)
Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR, pp. 2066–2073 (2012)
Guo, Y., Zhou, Y., Qin, X., Wang, W.: Which and where to focus: a simple yet accurate framework for arbitrary-shaped nearby text detection in scene images. In: ICANN (2021)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV, pp. 2961–2969 (2017)
Karatzas, D., et al.: Icdar 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)
Karatzas, D., et al.: Icdar 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)
Leng, Y., Xu, X., Qi, G.: Combining active learning and semi-supervised learning to construct svm classifier. Knowl.-Based Syst. 44, 121–131 (2013)
Li, W., Luo, D., Fang, B., Zhou, Y., Wang, W.: Video 3d sampling for self-supervised representation learning. arXiv preprint arXiv:2107.03578 (2021)
Li, X., et al.: Dense semantic contrast for self-supervised visual representation learning. In: ACM MM (2021)
Liao, M., Shi, B., Bai, X.: Textboxes++: a single-shot oriented scene text detector. TIP 27(8), 3676–3690 (2018)
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: Fots: fast oriented text spotting with a unified network. In: CVPR, pp. 5676–5685 (2018)
Luo, D., Fang, B., Zhou, Y., Zhou, Y., Wu, D., Wang, W.: Exploring relations in untrimmed videos for self-supervised learning. arXiv preprint arXiv:2008.02711 (2020)
Luo, D., et al.: Video cloze procedure for self-supervised spatio-temporal learning. In: AAAI, pp. 11701–11708 (2020)
Pise, N.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: CIS, vol. 2, pp. 30–34 (2008)
Qiao, Z., Qin, X., Zhou, Y., Yang, F., Wang, W.: Gaussian constrained attention network for scene text recognition. In: ICPR, pp. 3328–3335 (2021)
Qiao, Z., et al.: PIMNet: a parallel, iterative and mimicking network for scene text recognition. In: ACM MM (2021)
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, pp. 13528–13537 (2020)
Qin, X., et al.: Mask is all you need: Rethinking mask r-cnn for dense and arbitrary-shaped scene text detection. In: ACM MM (2021)
Qin, X., Zhou, Y., Guo, Y., Wu, D., Wang, W.: Fc 2 rn: a fully convolutional corner refinement network for accurate multi-oriented scene text detection. In: ICASSP. pp. 4350–4354 (2021)
Qin, X., Zhou, Y., Yang, D., Wang, W.: Curved text detection in natural scene images with semi-and weakly-supervised learning. In: ICDAR, pp. 559–564 (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. TPAMI 39(6), 1137–1149 (2016)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI 39(11), 2298–2304 (2016)
Tian, S., Lu, S., Li, C.: Wetext: scene text detection under weak supervision. In: ICCV, pp. 1492–1500 (2017)
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: ECCV, pp. 56–72 (2016)
Wang, K., Zhang, D., Li, Y., Zhang, R., Lin, L.: Cost-effective active learning for deep image classification. TCSVT 27(12), 2591–2600 (2016)
Wang, W., et al.: Shape robust text detection with progressive scale expansion network. In: CVPR, pp. 9336–9345 (2019)
Wang, W., et al.: Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: ICCV, pp. 8440–8449 (2019)
Wang, X., Wen, J., Alam, S., Jiang, Z., Wu, Y.: Semi-supervised learning combining transductive support vector machine with active learning. Neurocomputing 173, 1288–1298 (2016)
Wu, W., et al.: Synthetic-to-real unsupervised domain adaptation for scene text detection in the wild. In: ACCV (2020)
Yang, D., Zhou, Y., Wang, W.: Multi-view correlation distillation for incremental object detection. arXiv preprint arXiv:2107.01787 (2021)
Yang, D., Zhou, Y., Wu, D., Ma, C., Yang, F., Wang, W.: Two-level residual distillation based triple network for incremental object detection. arXiv preprint arXiv:2007.13428 (2020)
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: CVPR, pp. 6548–6557 (2020)
Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. TPAMI 37(7), 1480–1500 (2014)
Yoo, D., Kweon, I.S.: Learning loss for active learning. In: CVPR, pp. 93–102 (2019)
Zeng, G., Zhang, Y., Zhou, Y., Yang, X.: Beyond OCR + VQA: involving OCR into the flow for robust and accurate TextVQA. In: ACM MM (2021)
Zhan, F., Lu, S., Xue, C.: Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: ECCV, pp. 249–266 (2018)
Zhan, F., Xue, C., Lu, S.: Ga-dan: geometry-aware domain adaptation network for scene text detection and recognition. In: ICCV, pp. 9105–9115 (2019)
Zhang, Y., Liu, C., Zhou, Y., Wang, W., Wang, W., Ye, Q.: Progressive cluster purification for unsupervised feature learning. In: ICPR, pp. 8476–8483 (2021)
Zhang, Y., Zhou, Y., Wang, W.: Exploring instance relations for unsupervised feature embedding. arXiv preprint arXiv:2105.03341 (2021)
Zheng, Y., Huang, D., Liu, S., Wang, Y.: Cross-domain object detection through coarse-to-fine feature adaptation. In: CVPR, pp. 13766–13775 (2020)
Zhou, X., et al.: East: an efficient and accurate scene text detector. In: CVPR, pp. 5551–5560 (2017)
Acknowledgments
This work is supported by the Open Research Project of the State Key Laboratory of Media Convergence and Communication, Communication University of China, China (No. SKLMCC2020KF004), the Beijing Municipal Science & Technology Commission (Z191100007119002), the Key Research Program of Frontier Sciences, CAS, Grant NO ZDBS-LY-7024, and the National Natural Science Foundation of China (No. 62006221).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zeng, G., Zhang, Y., Zhou, Y., Yang, X. (2021). A Cost-Efficient Framework for Scene Text Detection in the Wild. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13031. Springer, Cham. https://doi.org/10.1007/978-3-030-89188-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-89188-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89187-9
Online ISBN: 978-3-030-89188-6
eBook Packages: Computer ScienceComputer Science (R0)