Skip to main content
Log in

Spatiotemporal text localization for videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Text in videos contains rich semantic information, which is useful for content based video understanding and retrieval. Although a great number of state-of-the-art methods are proposed to detect text in images and videos, few works focus on spatiotemporal text localization in videos. In this paper, we present a spatiotemporal text localization method with an improved detection efficiency and performance. Concretely, a unified framework is proposed which consists of the sampling-and-recovery model (SaRM) and the divide-and-conquer model (DaCM). SaRM aims at exploiting the temporal redundancy of text to increase the detection efficiency for videos. DaCM is designed to efficiently localize the text in spatiotemporal domain simultaneously. Besides, we construct a challenging video overlaid text dataset named UCAS-STLData, which contains 57070 frames with spatiotemporal ground truths. In the experiments, we comprehensively evaluate the proposed method on the publicly available overlaid text datasets and UCAS-STLData. A slight performance improvement is achieved compared with the state-of-the-art methods for spatiotemporal text localization, with a significant efficiency improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The dataset will be publicly available soon for researchers.

  2. In generally, videos are played at the speed of 25 frames per second.

References

  1. Bai X, Shi B, Zhang C, Cai X, Qi L (2017) Text/non-text image classification in the wild with convolutional neural networks. Pattern Recogn 66:437–446

    Article  Google Scholar 

  2. Busta M, Neumann L, Matas J (2015) Fastext: efficient unconstrained scene text detector. In: The International conference on computer vision (ICCV’15)

  3. Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: The IEEE Conference on computer vision and pattern recognition (CVPR’10). IEEE, pp 2963– 2970

  4. Fang S, Xie H, Chen Z, Zhu S, Gu X, Gao X (2017) Detecting Uyghur text in complex background images with convolutional neural network. Multimed Tools Appl 76(13):15,083–15,103

    Article  Google Scholar 

  5. Fernández D, Del Barrio A, Botella G, García C (2018) Fast and effective cu size decision based on spatial and temporal homogeneity detection. Multimed Tools Appl 77(5):5907–5927

    Article  Google Scholar 

  6. Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126

    Article  MathSciNet  Google Scholar 

  7. Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252–264

    Article  MathSciNet  Google Scholar 

  8. Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process Mag 35(1):84–100

    Article  Google Scholar 

  9. Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced mser trees. In: The European conference on computer vision (ECCV’14). Springer, pp 497–511

  10. Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: The European conference on computer vision (ECCV’14). Springer, pp 512–528

  11. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20

    Article  MathSciNet  Google Scholar 

  12. Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, de las Heras LP (2013) Icdar 2013 robust reading competition. In: The International conference on document analysis and recognition (ICDAR’13). IEEE, pp 1484–1493

  13. Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) Icdar 2015 competition on robust reading. In: The International conference on document analysis and recognition (ICDAR’15). IEEE, pp 1156–1160

  14. Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolution model for scene text detection and recognition in video. Pattern Recogn 54:128–148

    Article  Google Scholar 

  15. Khare V, Shivakumara P, Paramesran R, Blumenstein M (2017) Arbitrarily-oriented multi-lingual text detection in video. Multimed Tools Appl 76 (15):16,625–16,655

    Article  Google Scholar 

  16. Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355

    Article  MathSciNet  Google Scholar 

  17. Li Z, Tang J (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans Multimed 17(11):1989–1999

    Article  Google Scholar 

  18. Li Z, Tang J (2017) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288

    Article  MathSciNet  Google Scholar 

  19. Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26 (9):2138–2150

    Article  Google Scholar 

  20. Li Z, Tang J, He X (2017) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn Syst

  21. Liang G, Shivakumara P, Lu T, Tan CL (2015) Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images. IEEE Trans Image Process 24(11):4488–4501

    Article  MathSciNet  Google Scholar 

  22. Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: The AAAI Conference on artificial intelligence (AAAI’17), pp 4161–4167

  23. Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 988–996

  24. Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. arXiv:1703.01425

  25. Liu X, Wang W (2010) Extracting captions from videos using temporal feature. In: The ACM international conference on multimedia (ACM MM’10). ACM, pp 843–846

  26. Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489

    Article  Google Scholar 

  27. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: The European conference on computer vision (ECCV’16). Springer, pp 21–37

  28. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The IEEE Conference on computer vision and pattern recognition (CVPR’15), pp 3431–3440

  29. Lucas SM (2005) Icdar 2005 text locating competition results. In: The International conference on document analysis and recognition (ICDAR’05). IEEE, pp 80–84

  30. Ma J, Wang W, Lu K, Zhou J (2017) Scene text detection based on pruning strategy of mser-trees and linkage-trees. In: The IEEE International conference on multimedia and expo (ICME’17). IEEE, pp 367–372

  31. Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2011) Snoopertrack: text detection and tracking for outdoor videos. In: The IEEE International conference on image processing (ICIP’11). IEEE, pp 505–508

  32. Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: The IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 3538–3545

  33. Nguyen PX, Wang K, Belongie S (2014) Video text detection and recognition: dataset and benchmark. In: The IEEE Winter conference on applications of computer vision (WACV’14). IEEE, pp 776–783

  34. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: The Neural information processing systems (NIPS’15). Curran Associates, Inc, pp 91–99

  35. Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481

    Article  Google Scholar 

  36. Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)

  37. Shivakumara P, Dutta A, Phan TQ, Tan CL, Pal U (2011) A novel mutual nearest neighbor based symmetry for text frame classification in video. Pattern Recogn 44(8):1671–1683

    Article  Google Scholar 

  38. Shivakumara P, Phan TQ, Tan CL (2011) A laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33(2):412–419

    Article  Google Scholar 

  39. Shivakumara P, Sreedhar RP, Phan TQ, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circ Syst Vid Technol 22(8):1227–1235

    Article  Google Scholar 

  40. Shivakumara P, Phan TQ, Lu S, Tan CL (2013) Gradient vector flow and grouping-based method for arbitrarily oriented scene text detection in video images. IEEE Trans Circ Syst Vid Technol 23(10):1729–1739

    Article  Google Scholar 

  41. Sullivan GJ, Ohm J, Han WJ, Wiegand T (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Trans Circ Syst Vid Technol 22(12):1649–1668

    Article  Google Scholar 

  42. Tian S, Pei WY, Zuo ZY, Yin X (2016) Scene text detection in video by learning locally and globally. In: The International joint conference on artificial intelligence (IJCAI’16), vol 10, p 18

  43. Tian S, Yin X, Su Y, Hao HW (2017) A unified framework for tracking based text detection and recognition from web videos. IEEE Trans Pattern Anal Mach Intell

  44. Uchida S (2014) Text localization and recognition in images and video. In: Handbook of document image processing and recognition. Springer, pp 843–883

  45. Wu L, Shivakumara P, Lu T, Tan CL (2015) A new technique for multi-oriented scene text line detection and tracking in video. IEEE Trans Multimed 17(8):1137–1152

    Article  Google Scholar 

  46. Yang C, Yin XC, Pei WY, Tian S, Zuo ZY, Zhu C, Yan J Tracking based multi-orientation scene text detection: a unified framework with dynamic programming. IEEE Trans Image Process, 26

  47. Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 2017 ACM on multimedia conference (ACM MM’17). ACM, pp 146–153

  48. Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: IEEE Conference on computer vision and pattern recognition (CVPR’12). IEEE, pp 1083–1090

  49. Yao C, Bai X, Sang N, Zhou X, Zhou S, Cao ZM (2016) Scene text detection via holistic, multi-channel prediction. arXiv:1606.09002

  50. Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500

    Article  Google Scholar 

  51. Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605

    Article  MathSciNet  Google Scholar 

  52. Yin X, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36(5):970–983

    Article  Google Scholar 

  53. Yin X, Zuo ZY, Tian S, Liu CL (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773

    Article  MathSciNet  Google Scholar 

  54. Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol

  55. Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: The IEEE Conference on computer vision and pattern recognition (CVPR’17)

Download references

Acknowledgments

This work is supported by National Key R&D Program of China under contract No. 2017YFB1002203, and also supported by National Nature Science Foundation of China (NSFC) under Grant Nos. 61772495.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiqiang Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, Y., Wang, W., Huang, S. et al. Spatiotemporal text localization for videos. Multimed Tools Appl 77, 29323–29345 (2018). https://doi.org/10.1007/s11042-018-6081-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6081-7

Keywords

Navigation