Spatiotemporal text localization for videos

Cai, Yuanqiang; Wang, Weiqiang; Huang, Shao; Ma, Jin; Lu, Ke

doi:10.1007/s11042-018-6081-7

Spatiotemporal text localization for videos

Published: 07 June 2018

Volume 77, pages 29323–29345, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yuanqiang Cai¹,
Weiqiang Wang¹,
Shao Huang¹,
Jin Ma¹ &
…
Ke Lu²

365 Accesses
3 Citations
Explore all metrics

Abstract

Text in videos contains rich semantic information, which is useful for content based video understanding and retrieval. Although a great number of state-of-the-art methods are proposed to detect text in images and videos, few works focus on spatiotemporal text localization in videos. In this paper, we present a spatiotemporal text localization method with an improved detection efficiency and performance. Concretely, a unified framework is proposed which consists of the sampling-and-recovery model (SaRM) and the divide-and-conquer model (DaCM). SaRM aims at exploiting the temporal redundancy of text to increase the detection efficiency for videos. DaCM is designed to efficiently localize the text in spatiotemporal domain simultaneously. Besides, we construct a challenging video overlaid text dataset named UCAS-STLData, which contains 57070 frames with spatiotemporal ground truths. In the experiments, we comprehensively evaluate the proposed method on the publicly available overlaid text datasets and UCAS-STLData. A slight performance improvement is achieved compared with the state-of-the-art methods for spatiotemporal text localization, with a significant efficiency improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robustly detect different types of text in videos

Article 27 January 2020

Decade research on text detection in images/videos: a review

Article 06 June 2019

A Robust Approach for Scene Text Detection and Tracking in Video

Notes

The dataset will be publicly available soon for researchers.
In generally, videos are played at the speed of 25 frames per second.

References

Bai X, Shi B, Zhang C, Cai X, Qi L (2017) Text/non-text image classification in the wild with convolutional neural networks. Pattern Recogn 66:437–446
Article Google Scholar
Busta M, Neumann L, Matas J (2015) Fastext: efficient unconstrained scene text detector. In: The International conference on computer vision (ICCV’15)
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: The IEEE Conference on computer vision and pattern recognition (CVPR’10). IEEE, pp 2963– 2970
Fang S, Xie H, Chen Z, Zhu S, Gu X, Gao X (2017) Detecting Uyghur text in complex background images with convolutional neural network. Multimed Tools Appl 76(13):15,083–15,103
Article Google Scholar
Fernández D, Del Barrio A, Botella G, García C (2018) Fast and effective cu size decision based on spatial and temporal homogeneity detection. Multimed Tools Appl 77(5):5907–5927
Article Google Scholar
Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126
Article MathSciNet Google Scholar
Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252–264
Article MathSciNet Google Scholar
Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process Mag 35(1):84–100
Article Google Scholar
Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced mser trees. In: The European conference on computer vision (ECCV’14). Springer, pp 497–511
Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: The European conference on computer vision (ECCV’14). Springer, pp 512–528
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Article MathSciNet Google Scholar
Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, de las Heras LP (2013) Icdar 2013 robust reading competition. In: The International conference on document analysis and recognition (ICDAR’13). IEEE, pp 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) Icdar 2015 competition on robust reading. In: The International conference on document analysis and recognition (ICDAR’15). IEEE, pp 1156–1160
Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolution model for scene text detection and recognition in video. Pattern Recogn 54:128–148
Article Google Scholar
Khare V, Shivakumara P, Paramesran R, Blumenstein M (2017) Arbitrarily-oriented multi-lingual text detection in video. Multimed Tools Appl 76 (15):16,625–16,655
Article Google Scholar
Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355
Article MathSciNet Google Scholar
Li Z, Tang J (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans Multimed 17(11):1989–1999
Article Google Scholar
Li Z, Tang J (2017) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288
Article MathSciNet Google Scholar
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26 (9):2138–2150
Article Google Scholar
Li Z, Tang J, He X (2017) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn Syst
Liang G, Shivakumara P, Lu T, Tan CL (2015) Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images. IEEE Trans Image Process 24(11):4488–4501
Article MathSciNet Google Scholar
Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: The AAAI Conference on artificial intelligence (AAAI’17), pp 4161–4167
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 988–996
Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. arXiv:1703.01425
Liu X, Wang W (2010) Extracting captions from videos using temporal feature. In: The ACM international conference on multimedia (ACM MM’10). ACM, pp 843–846
Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489
Article Google Scholar
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: The European conference on computer vision (ECCV’16). Springer, pp 21–37
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The IEEE Conference on computer vision and pattern recognition (CVPR’15), pp 3431–3440
Lucas SM (2005) Icdar 2005 text locating competition results. In: The International conference on document analysis and recognition (ICDAR’05). IEEE, pp 80–84
Ma J, Wang W, Lu K, Zhou J (2017) Scene text detection based on pruning strategy of mser-trees and linkage-trees. In: The IEEE International conference on multimedia and expo (ICME’17). IEEE, pp 367–372
Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2011) Snoopertrack: text detection and tracking for outdoor videos. In: The IEEE International conference on image processing (ICIP’11). IEEE, pp 505–508
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: The IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 3538–3545
Nguyen PX, Wang K, Belongie S (2014) Video text detection and recognition: dataset and benchmark. In: The IEEE Winter conference on applications of computer vision (WACV’14). IEEE, pp 776–783
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: The Neural information processing systems (NIPS’15). Curran Associates, Inc, pp 91–99
Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481
Article Google Scholar
Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)
Shivakumara P, Dutta A, Phan TQ, Tan CL, Pal U (2011) A novel mutual nearest neighbor based symmetry for text frame classification in video. Pattern Recogn 44(8):1671–1683
Article Google Scholar
Shivakumara P, Phan TQ, Tan CL (2011) A laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33(2):412–419
Article Google Scholar
Shivakumara P, Sreedhar RP, Phan TQ, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circ Syst Vid Technol 22(8):1227–1235
Article Google Scholar
Shivakumara P, Phan TQ, Lu S, Tan CL (2013) Gradient vector flow and grouping-based method for arbitrarily oriented scene text detection in video images. IEEE Trans Circ Syst Vid Technol 23(10):1729–1739
Article Google Scholar
Sullivan GJ, Ohm J, Han WJ, Wiegand T (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Trans Circ Syst Vid Technol 22(12):1649–1668
Article Google Scholar
Tian S, Pei WY, Zuo ZY, Yin X (2016) Scene text detection in video by learning locally and globally. In: The International joint conference on artificial intelligence (IJCAI’16), vol 10, p 18
Tian S, Yin X, Su Y, Hao HW (2017) A unified framework for tracking based text detection and recognition from web videos. IEEE Trans Pattern Anal Mach Intell
Uchida S (2014) Text localization and recognition in images and video. In: Handbook of document image processing and recognition. Springer, pp 843–883
Wu L, Shivakumara P, Lu T, Tan CL (2015) A new technique for multi-oriented scene text line detection and tracking in video. IEEE Trans Multimed 17(8):1137–1152
Article Google Scholar
Yang C, Yin XC, Pei WY, Tian S, Zuo ZY, Zhu C, Yan J Tracking based multi-orientation scene text detection: a unified framework with dynamic programming. IEEE Trans Image Process, 26
Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 146–153
Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 1083–1090
Yao C, Bai X, Sang N, Zhou X, Zhou S, Cao ZM (2016) Scene text detection via holistic, multi-channel prediction. arXiv:1606.09002
Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500
Article Google Scholar
Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605
Article MathSciNet Google Scholar
Yin X, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36(5):970–983
Article Google Scholar
Yin X, Zuo ZY, Tian S, Liu CL (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773
Article MathSciNet Google Scholar
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)

Download references

Acknowledgments

This work is supported by National Key R&D Program of China under contract No. 2017YFB1002203, and also supported by National Nature Science Foundation of China (NSFC) under Grant Nos. 61772495.

Author information

Authors and Affiliations

School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 101408, China
Yuanqiang Cai, Weiqiang Wang, Shao Huang & Jin Ma
School of Engineering Science, University of Chinese Academy of Sciences, Beijing, 100049, China
Ke Lu

Authors

Yuanqiang Cai
View author publications
You can also search for this author in PubMed Google Scholar
Weiqiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Ke Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiqiang Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cai, Y., Wang, W., Huang, S. et al. Spatiotemporal text localization for videos. Multimed Tools Appl 77, 29323–29345 (2018). https://doi.org/10.1007/s11042-018-6081-7

Download citation

Received: 13 December 2017
Revised: 25 April 2018
Accepted: 30 April 2018
Published: 07 June 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-018-6081-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal text localization for videos

Abstract

Access this article

Similar content being viewed by others

Robustly detect different types of text in videos

Decade research on text detection in images/videos: a review

A Robust Approach for Scene Text Detection and Tracking in Video

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatiotemporal text localization for videos

Abstract

Access this article

Similar content being viewed by others

Robustly detect different types of text in videos

Decade research on text detection in images/videos: a review

A Robust Approach for Scene Text Detection and Tracking in Video

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation