Abstract
Recently, video text detection, tracking and recognition in natural scenes are becoming very popular in the computer vision community. However, most existing algorithms and benchmarks focus on common text cases (e.g., normal size, density) and single scenario, while ignore extreme video texts challenges, i.e., dense and small text in various scenarios. In this competition report, we establish a video text reading benchmark, named DSText, which focuses on dense and small text reading challenge in the video with various scenarios. Compared with the previous datasets, the proposed dataset mainly include three new challenges: 1) Dense video texts, new challenge for video text spotter. 2) High-proportioned small texts. 3) Various new scenarios, e.g., ‘Game’, ‘Sports’, etc. The proposed DSText includes 100 video clips from 12 open scenarios, supporting two tasks (i.e., video text tracking (Task 1) and end-to-end video text spotting (Task2)). During the competition period (opened on 15th February, 2023 and closed on 20th March, 2023), a total of 24 teams participated in the three proposed tasks with around 30 valid submissions, respectively. In this article, we describe detailed statistical information of the dataset, tasks, evaluation protocols and the results summaries of the ICDAR 2023 on DSText competition. Moreover, we hope the benchmark will promise the video text research in the community.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
References
Yin, X.-C., Zuo, Z.-Y., Tian, S., Liu, C.-L.: Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans. Image Process. 25(6), 2752–2773 (2016)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using lstms. In: International Conference on Machine Learning, pp. 843–852 (2015)
Dong, J., et al.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4065–4080 (2021)
Anagnostopoulos, C.-N.E., Anagnostopoulos, I.E., Psoroulas, I.D., Loumos, V., Kayafas, E.: License plate recognition from still images and video sequences: a survey. IEEE Trans. Intell. Transp. Syst. 9(3), 377–391 (2008)
Karatzas, D., et al.: Competition on robust reading. IEEE Int. Conf. Doc. Anal. Recogn. 2015, 1156–1160 (2015)
Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: dataset and benchmark. In: IEEE Winter Conference on Applications of Computer Vision, pp. 776–783 (2014)
Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Karatzas, D., Jawahar, C.: Roadtext-1k: text detection & recognition dataset for driving videos. In: IEEE International Conference on Robotics and Automation, pp. 11 074–11 080 (2020)
Cheng, Z., Lu, J., Niu, Y., Pu, S., Wu, F., Zhou, S.: You only recognize once: towards fast video text spotting. In: ACM International Conference on Multimedia, pp. 855–863 (2019)
Wu, W., et al.: A bilingual, Openworld video text dataset and end-to-end video text spotter with transformer. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Zhou, X., Zhou, S., Yao, C., Cao, Z., Yin, Q.: Icdar 2015 text reading in the wild competition, arXiv preprintarXiv:1506.03184 (2015)
Dendorfer, P., et al.: Cvpr19 tracking and detection challenge: how crowded can it get? arXiv preprintarXiv:1906.04567 (2019)
Karatzas, D., et al.: Icdar,: robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, vol. 2013, pp. 1484–1493. IEEE (2013)
Li, Y., Huang, C., Nevatia, R.: Learning to associate: Hybridboosted multi-target tracker for crowded scene. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2009, pp. 2953–2960. IEEE (2009)
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Workshops of European Conference on Computer Vision, pp. 17–35 (2016)
Wu, W., et al.: End-to-end video text spotting with transformer, arXiv preprintarXiv:2203.10539, (2022)
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Wang, W., et al.: Internimage: exploring large-scale vision foundation models with deformable convolutions, arXiv preprintarXiv:2211.05778 (2022)
Zhang, Y., et al.: Bytetrack: multi-object tracking by associating every detection box. In: Computer Vision-ECCV,: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXII. Springer vol. 2022, pp. 1–21 (2022). https://doi.org/10.1007/978-3-031-20047-2_1
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 474–11 481 (2020)
Gao, Y., et al.: Video text tracking with a spatio-temporal complementary model. IEEE Trans. Image Process. 30, 9321–9331 (2021)
Aharon, N., Orfaig, R., Bobrovsky, B.-Z.: Bot-sort: robust associations multi-pedestrian tracking, arXiv preprintarXiv:2206.14651 (2022)
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images, arXiv preprintarXiv:1601.07140 (2016)
Shi, B., et al.: Icdar2017 competition on reading Chinese text in the wild (rctw-17). In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1429–1434. IEEE (2017)
Chng, C.K., et al.: Icdar2019 robust reading challenge on arbitrary-shaped text-RRC-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1571–1576. IEEE (2019)
Sun, Y., Liu, J., Liu, W., Han, J., Ding, E., Liu, J.: Chinese street view text: large-scale Chinese text reading with partially supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9086–9095 (2019)
autista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13688, pp. 178–196. Springer, Cham (2022).
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7098–7107 (2021)
Wu, W., et al.: Real-time end-to-end video text spotter with contrastive representation learning, arXiv preprintarXiv:2207.08417 (2022)
Acknowledgements
This competition is supported by the National Natural Science Foundation (NSFC#62225603).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, W. et al. (2023). ICDAR 2023 Competition on Video Text Reading for Dense and Small Text. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-41679-8_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)