Abstract
We propose a new Transformer-based text detection model, named Dynamic Queries enhanced DEtection TRansformer (DQ-DETR), to detect arbitrary shape text instances from images with high localization accuracy. Unlike previous Transformer-based methods which take all control points on the boundaries/center-lines of all text instances as the queries of each Transformer decoder layer, we extend the query set for each decoder layer gradually, allowing the DQ-DETR to achieve higher localization accuracy by detecting control points for each text instance progressively. Specifically, after refining the positions of existing control points from the preceding decoder layer, each decoder layer further appends a new point on each side of each center-line segment, which are input to the next decoder layer as additional queries for detecting new control points. As offsets from the new control points to the added reference points are small, their positions can be predicted more precisely, leading to higher center-line detection accuracy. Consequently, our DQ-DETR achieves state-of-the-art performance on five public text detection benchmarks, including MLT2017, Total-Text, CTW1500, ArT and DAST1500.
This work was done when Jiawei Wang was an intern in MMI Group, Microsoft Research Asia, Beijing, China.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhong, Z., Jin, L., Huang, S.: DeepText: a new approach for text proposal generation and text detection in natural images. In: ICASSP, pp. 1208–1212 (2017)
Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: CVPR, pp. 5551–5560 (2017)
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27(8), 3676–3690 (2018)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)
Ma, J., et al.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20(11), 3111–3122 (2018)
Liu, Y., Jin, L.: Deep matching prior network: toward tighter multi-oriented text detection. In: CVPR, pp. 1962–1969 (2017)
Liu, Y., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recogn. 90, 337–345 (2019)
Wang, X., Jiang, Y., Luo, Z., Liu, C., Choi, H., Kim, S.: Arbitrary shape scene text detection with adaptive text region representation. In: CVPR, pp. 6449–6458 (2019)
Wang, F., Chen, Y., Wu, F., Li, X.: TextRay: contour-based geometric modeling for arbitrary-shaped scene text detection. In: ACM MM, pp. 111–119 (2020)
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive bezier-curve network. In: CVPR, pp. 9809–9818 (2020)
Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., Zhang, W.: Fourier contour embedding for arbitrary-shaped text detection. In: CVPR, pp. 3123–3131 (2021)
Dai, P., Zhang, S., Zhang, H., Cao, X.: Progressive contour regression for arbitrary-shape scene text detection. In: CVPR, pp. 7393–7402 (2021)
Zhang, S.X., Zhu, X., Yang, C., Yin, X.C.: Arbitrary shape text detection via boundary transformer. arXiv preprint arXiv:2205.05320 (2022)
Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 71–88. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_5
Xie, E., Zang, Y., Shao, S., Yu, G., Yao, C., Li, G.: Scene text detection with supervised pyramid context network. In: AAAI, pp. 9038–9045 (2019)
Zhang, C., et al.: Look more than once: an accurate detector for text of arbitrary shapes. In: CVPR, pp. 10552–10561 (2019)
Wang, Y., Xie, H., Zha, Z.J., Xing, M., Fu, Z., Zhang, Y.: ContourNet: taking a further step toward accurate arbitrary-shaped scene text detection. In: CVPR, pp. 11753–11762 (2020)
Qin, X., et al.: Mask is all you need: rethinking mask R-CNN for dense and arbitrary-shaped scene text detection. In: ACM MM, pp. 414–423 (2021)
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: CVPR, pp. 4159–4167 (2016)
Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: CVPR, pp. 7553–7563 (2018)
Deng, D., Liu, H., Li, X., Cai, D.: PixelLink: detecting scene text via instance segmentation. In: AAAI, pp. 6773–6780 (2018)
Wang, W., et al.: Shape robust text detection with progressive scale expansion network. In: CVPR, pp. 9336–9345 (2019)
Wu, Y., Natarajan, P.: Self-organized text detection with minimal post-processing via border learning. In: ICCV, pp. 5000–5009 (2017)
Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., Bai, X.: TextField: learning a deep direction field for irregular scene text detection. IEEE Trans. Image Process. 28(11), 5566–5579 (2019)
Xue, C., Lu, S., Zhang, W.: MSR: multi-scale shape regression for scene text detection. In: IJCAI, pp. 20–36 (2019)
Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E.: WordSup: exploiting word annotations for character based text detection. In: ICCV, pp. 4940–4949 (2017)
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: CVPR, pp. 2550–2558 (2017)
Tang, J., Yang, Z., Wang, Y., Zheng, Q., Xu, Y., Bai, X.: SegLink++: detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern Recogn. 96, 106954 (2019)
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: TextSnake: a flexible representation for detecting text of arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 19–35. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_2
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR, pp. 9365–9374 (2019)
Zhang, S.X., et al.: Deep relational reasoning graph network for arbitrary shape text detection. In: CVPR, pp. 9699–9708 (2020)
Ma, C., Sun, L., Zhong, Z., Huo, Q.: ReLaText: exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks. Pattern Recogn. 111, 107684 (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)
Meng, D., et al.: Conditional DETR for fast training convergence. In: ICCV, pp. 3651–3660 (2021)
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor DETR: query design for transformer-based detector. In: AAAI, pp. 2567–2575 (2022)
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. In: ICLR (2022)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: accelerate DETR training by introducing query denoising. In: CVPR, pp. 13619–13627 (2022)
Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)
Raisi, Z., Naiel, M.A., Younes, G., Wardell, S., Zelek, J.S.: Transformer-based text detection in the wild. In: CVPR, pp. 3162–3171 (2021)
Tang, J., et al.: Few could be better than all: feature sampling and grouping for scene text detection. In: CVPR, pp. 4563–4572 (2022)
Zhang, X., Su, Y., Tripathi, S., Tu, Z.: Text spotting transformers. In: CVPR, pp. 9519–9528 (2022)
Ye, M., Zhang, J., Zhao, S., Liu, J., Du, B., Tao, D.: DPText-DETR: towards better scene text detection with dynamic points in transformer. arXiv preprint arXiv:2207.04491 (2022)
Ye, M., et al.: DeepSolo: let transformer decoder with explicit points solo for text spotting. arXiv preprint arXiv:2211.10772 (2022)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Zhang, S.X., Zhu, X., Yang, C., Wang, H., Yin, X.C.: Adaptive boundary proposal network for arbitrary shape text detection. In: ICCV, pp. 1305–1314 (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Dong, Q., Tu, Z., Liao, H., Zhang, Y., Mahadevan, V., Soatto, S.: Visual relationship detection using part-and-sum transformers with composite queries. In: ICCV, pp. 3550–3559 (2021)
Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Nayef, N., et al.: ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In: ICDAR, pp. 1454–1459 (2017)
Ch’ng, C.K., Chan, C.S.: Total-text: a comprehensive dataset for scene text detection and recognition. In: ICDAR, pp. 935–942 (2017)
Chng, C.K., et al.: ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-art. In: ICDAR, pp. 1571–1576 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Liu, H., Li, X., Liu, B., Jiang, D., Liu, Y., Ren, B.: Neural collaborative graph machines for table structure recognition. In: CVPR, pp. 4533–4542 (2022)
Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: AAAI, pp. 11474–11481 (2020)
Xiao, S., Peng, L., Yan, R., An, K., Yao, G., Min, J.: Sequential deformation for accurate scene text detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 108–124. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_7
He, M., et al.: MOST: a multi-oriented scene text detector with localization refinement. In: CVPR, pp. 8813–8822 (2021)
Wang, W., et al.: Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: ICCV, pp. 8440–8449 (2019)
Ye, J., Chen, Z., Liu, J., Du, B.: TextFuseNet: scene text detection with richer fused features. In: IJCAI, pp. 516–522 (2020)
Liu, Y., et al.: ABCNet v2: adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 8048–8064 (2021)
Du, B., Ye, J., Zhang, J., Liu, J., Tao, D.: I3CL: intra- and inter-instance collaborative learning for arbitrary-shaped scene text detection. Int. J. Comput. Vision 130(8), 1961–1977 (2022)
Huang, M., et al.: SwinTextSpotter: scene text spotting via better synergy between text detection and text recognition. In: CVPR, pp. 4593–4603 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ma, C., Sun, L., Wang, J., Huo, Q. (2023). DQ-DETR: Dynamic Queries Enhanced Detection Transformer for Arbitrary Shape Text Detection. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-41679-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)