ICDAR 2023 Competition on Video Text Reading for Dense and Small Text

Wu, Weijia; Zhao, Yuzhong; Li, Zhuang; Li, Jiahong; Shou, Mike Zheng; Pal, Umapada; Karatzas, Dimosthenis; Bai, Xiang

doi:10.1007/978-3-031-41679-8_23

Weijia Wu ORCID: orcid.org/0000-0001-6011-5174¹¹,
Yuzhong Zhao¹²,
Zhuang Li¹³,
Jiahong Li¹³,
Mike Zheng Shou¹⁴,
Umapada Pal¹⁵,
Dimosthenis Karatzas¹⁶ &
…
Xiang Bai¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14188))

Included in the following conference series:

International Conference on Document Analysis and Recognition

953 Accesses

Abstract

Recently, video text detection, tracking and recognition in natural scenes are becoming very popular in the computer vision community. However, most existing algorithms and benchmarks focus on common text cases (e.g., normal size, density) and single scenario, while ignore extreme video texts challenges, i.e., dense and small text in various scenarios. In this competition report, we establish a video text reading benchmark, named DSText, which focuses on dense and small text reading challenge in the video with various scenarios. Compared with the previous datasets, the proposed dataset mainly include three new challenges: 1) Dense video texts, new challenge for video text spotter. 2) High-proportioned small texts. 3) Various new scenarios, e.g., ‘Game’, ‘Sports’, etc. The proposed DSText includes 100 video clips from 12 open scenarios, supporting two tasks (i.e., video text tracking (Task 1) and end-to-end video text spotting (Task2)). During the competition period (opened on 15th February, 2023 and closed on 20th March, 2023), a total of 24 teams participated in the three proposed tasks with around 30 valid submissions, respectively. In this article, we describe detailed statistical information of the dataset, tasks, evaluation protocols and the results summaries of the ICDAR 2023 on DSText competition. Moreover, we hope the benchmark will promise the video text research in the community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Yin, X.-C., Zuo, Z.-Y., Tian, S., Liu, C.-L.: Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans. Image Process. 25(6), 2752–2773 (2016)
Article MathSciNet MATH Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using lstms. In: International Conference on Machine Learning, pp. 843–852 (2015)
Google Scholar
Dong, J., et al.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4065–4080 (2021)
Google Scholar
Anagnostopoulos, C.-N.E., Anagnostopoulos, I.E., Psoroulas, I.D., Loumos, V., Kayafas, E.: License plate recognition from still images and video sequences: a survey. IEEE Trans. Intell. Transp. Syst. 9(3), 377–391 (2008)
Article Google Scholar
Karatzas, D., et al.: Competition on robust reading. IEEE Int. Conf. Doc. Anal. Recogn. 2015, 1156–1160 (2015)
Google Scholar
Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: dataset and benchmark. In: IEEE Winter Conference on Applications of Computer Vision, pp. 776–783 (2014)
Google Scholar
Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Karatzas, D., Jawahar, C.: Roadtext-1k: text detection & recognition dataset for driving videos. In: IEEE International Conference on Robotics and Automation, pp. 11 074–11 080 (2020)
Google Scholar
Cheng, Z., Lu, J., Niu, Y., Pu, S., Wu, F., Zhou, S.: You only recognize once: towards fast video text spotting. In: ACM International Conference on Multimedia, pp. 855–863 (2019)
Google Scholar
Wu, W., et al.: A bilingual, Openworld video text dataset and end-to-end video text spotter with transformer. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Google Scholar
Zhou, X., Zhou, S., Yao, C., Cao, Z., Yin, Q.: Icdar 2015 text reading in the wild competition, arXiv preprintarXiv:1506.03184 (2015)
Google Scholar
Dendorfer, P., et al.: Cvpr19 tracking and detection challenge: how crowded can it get? arXiv preprintarXiv:1906.04567 (2019)
Google Scholar
Karatzas, D., et al.: Icdar,: robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, vol. 2013, pp. 1484–1493. IEEE (2013)
Google Scholar
Li, Y., Huang, C., Nevatia, R.: Learning to associate: Hybridboosted multi-target tracker for crowded scene. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2009, pp. 2953–2960. IEEE (2009)
Google Scholar
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Workshops of European Conference on Computer Vision, pp. 17–35 (2016)
Google Scholar
Wu, W., et al.: End-to-end video text spotting with transformer, arXiv preprintarXiv:2203.10539, (2022)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Article Google Scholar
Wang, W., et al.: Internimage: exploring large-scale vision foundation models with deformable convolutions, arXiv preprintarXiv:2211.05778 (2022)
Google Scholar
Zhang, Y., et al.: Bytetrack: multi-object tracking by associating every detection box. In: Computer Vision-ECCV,: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXII. Springer vol. 2022, pp. 1–21 (2022). https://doi.org/10.1007/978-3-031-20047-2_1
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 474–11 481 (2020)
Google Scholar
Gao, Y., et al.: Video text tracking with a spatio-temporal complementary model. IEEE Trans. Image Process. 30, 9321–9331 (2021)
Article Google Scholar
Aharon, N., Orfaig, R., Bobrovsky, B.-Z.: Bot-sort: robust associations multi-pedestrian tracking, arXiv preprintarXiv:2206.14651 (2022)
Google Scholar
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images, arXiv preprintarXiv:1601.07140 (2016)
Google Scholar
Shi, B., et al.: Icdar2017 competition on reading Chinese text in the wild (rctw-17). In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1429–1434. IEEE (2017)
Google Scholar
Chng, C.K., et al.: Icdar2019 robust reading challenge on arbitrary-shaped text-RRC-art. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1571–1576. IEEE (2019)
Google Scholar
Sun, Y., Liu, J., Liu, W., Han, J., Ding, E., Liu, J.: Chinese street view text: large-scale Chinese text reading with partially supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9086–9095 (2019)
Google Scholar
autista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13688, pp. 178–196. Springer, Cham (2022).
Google Scholar
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7098–7107 (2021)
Google Scholar
Wu, W., et al.: Real-time end-to-end video text spotter with contrastive representation learning, arXiv preprintarXiv:2207.08417 (2022)
Google Scholar

Download references

Acknowledgements

This competition is supported by the National Natural Science Foundation (NSFC#62225603).

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, China
Weijia Wu
University of Chinese Academy of Sciences, Beijing, China
Yuzhong Zhao
Kuaishou Technology, Beijing, China
Zhuang Li & Jiahong Li
National University of Singapore, Singapore, Singapore
Mike Zheng Shou
Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Chennai, India
Umapada Pal
Computer Vision Centre, Universitat Autónoma de Barcelona, Barcelona, Spain
Dimosthenis Karatzas
Huazhong University of Science and Technology, Wuhan, China
Xiang Bai

Authors

Weijia Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhuang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiahong Li
View author publications
You can also search for this author in PubMed Google Scholar
Mike Zheng Shou
View author publications
You can also search for this author in PubMed Google Scholar
Umapada Pal
View author publications
You can also search for this author in PubMed Google Scholar
Dimosthenis Karatzas
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weijia Wu .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, W. et al. (2023). ICDAR 2023 Competition on Video Text Reading for Dense and Small Text. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-41679-8_23
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

ICDAR 2023 Competition on Video Text Reading for Dense and Small Text