An end to end system for subtitle text extraction from movie videos

Elshahaby, Hossam; Rashwan, Mohsen

doi:10.1007/s12652-021-02951-1

An end to end system for subtitle text extraction from movie videos

Original Research
Published: 22 February 2021

Volume 13, pages 1853–1865, (2022)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

416 Accesses
5 Citations
Explore all metrics

Abstract

A new technique for text detection inside a complex graphical background, its extraction, and enhancement to be easily recognized using the optical character recognition (OCR). The technique uses a deep neural network for feature extraction and classifying the text as containing text or not. An error handling and correction (EHC) technique is used to resolve classification errors. A multiple frame integration (MFI) algorithm is introduced to extract the graphical text from its background. Text enhancement is done by adjusting the contrast, minimize noise, and increasing the pixels resolution. A standalone software Component-Off-The-Shelf (COTS) is used to recognize the text characters and qualify the system performance. Generalization for multilingual text is done with the proposed solution. A newly created dataset containing videos with different languages is collected for this purpose to be used as a benchmark. A new HMVGG16 convolutional neural network (CNN) is used for frame classification as text containing or non-text containing, has accuracy equals to 98%. The introduced system weighted average caption extraction accuracy equals to 96.15%. The correctly detected characters (CDC) average recognition accuracy using the Abbyy SDK OCR engine equals 97.75%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A system for detection of moving caption text in videos: a news use case

Article 22 April 2021

Augmented Scene Text Recognition Using Crosswise Feature Extraction

Article 15 September 2021

Detection and Tracking of Text from Video Using MSER and SIFT

References

Alves W, Hashimoto R (2010) Text regions extracted from scene images by ultimate attribute opening and decision tree classification. In: Proceedings of the 23rd Sibgrapi conference on graphics, patterns, and images
Audithan S, Chandrasekaran RM (2009) Document text extraction from document images using Haar discrete wavelet transform. Eur J Sci Res 36(04):502–512
Google Scholar
Cho H, Sung M, Jun B (2016) Canny text detector: fast and robust scene text localization algorithm. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3566–3573
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: advances in neural information processing systems, pp 379–387
Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware CNN model. In: Proceedings of the IEEE international conference on computer vision, pp 1134–1142
Gomez L, Karatzas D (2017) Text proposals: a text specific selective search algorithm for word spotting in the wild. Pattern Recogn 70:60–74
Article Google Scholar
Gorinski P, Lapata M (2018) What’s this movie about? A joint neural network architecture for movie content analysis. In: University of Edinburgh, Proceedings of NAACL-HLT, pp 1770–1781
Grover S, Arora K, Mitra S (2009) Text extraction from document images using edge information. In: IEEE India Council Conference
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localization in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324
Haq I, Muhammad K, Hussain T, Kwon S, Sodanil M, Baik S, Lee M (2019) Movie scene segmentation using object detection and set theory. Int J Distrib Sens Netw 15(6)
He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He T, Huang W, Qiao Y, Yao J (2016b) Text attentional convolutional neural network for scene text detection. IEEE Trans Image Process 25(6):2529–2541
Article MathSciNet Google Scholar
He P, Huang W, He T, Zhu Q, Qiao Y, Li X (2017) Single shot text detector with regional attention. In: Computer vision and pattern recognition, Cornell University, arXiv:1709.00138
Hesham M, Hani B, Fouad N, Amer E (2018) Smart trailer: automatic generation of movie trailer using only subtitles. In: First international workshop on deep and representation learning (IWDRL), IEEE, pp 26–30
Hoang T, Tabbone S (2010) Text extraction from graphical document images using sparse representation. In: Proceedings of the 9th IAPR international workshop on document analysis systems, pp 143–150
https://pixabay.com/vectors/bitcoin-money-cryptocurrency-4851383/. Accessed 28 Sept 2020
https://www.dreamstime.com/photos-images/autonomous-car.html. Accessed 28 Sept 2020
https://www.freepik.com/premium-photo/engineer-check-control-welding-robotics-automatic-arms-machine_5284742.htm. Accessed 28 Sept 2020
https://www.robots.ox.ac.uk/~vgg/software/textspot/. Accessed 10 June 2020
Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced MSER trees. In: European conference on computer vision, Springer, Zurich, pp 497–511
Indermühle E, Liwicki M, Bunke H (2010) IAMonDo-database: an online handwritten document database with non-uniform contents. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS ’10), pp 97–104
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20
Article MathSciNet Google Scholar
Jung K, Kim E (2004) Automatic text extraction for content-based image indexing. In: Proceedings of PAKDD, pp 497–507
Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 845–853
Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: AAAI, pp 4161–4167
Liu X, Samarabandu J (2006) Multiscale edge-based text extraction from complex images. In: Proceedings of the international conference of multimedia and Expo, pp 1721–1724
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Lu Q, Wang Y (2019) Automatic text location of multimedia video for subtitle frame. J Ambient Intell Humaniz Comput
Moradi M, Mozaffari S, Orouji A (2010) Farsi/Arabic text extraction from video images by corner detection. In: 2010 6th Iranian conference on machine vision and image processing, pp 1–6
Nagabhushan P, Nirmala S (2009) Text extraction in complex color document images for enhanced readability. Intell Inf Manag 2:120–133
Google Scholar
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: Computer vision and pattern recognition (CVPR) IEEE conference, pp 3538–3545
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, Santiago: IEEE Computer Society, pp 1520–1528
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651
Article Google Scholar
Shi J, Tomasi C (1994) Good features to track. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 593–600
Shivakumara P, Dutta A, Pal U, Tan C (2010) A new method for handwritten scene text detection in video. In: International conference on frontiers in handwriting recognition, pp 16–18
Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas: IEEE Computer Society, arXiv:1604.03540
Sun L, Huo Q, Jia W, Chen K (2015) A robust approach for text detection from natural scene images. Pattern Recogn 48(9):2906–2920
Article Google Scholar
Tian S, Pan Y, Huang C, Lu S, Yu K, Tan C (2015) Text flow: a unified text detection system in natural scene images. In: Proceedings of the IEEE international conference on computer vision, pp 4651–4659
Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision, pp 56–72
Vijayakumar V, Nedunchezhianm R (2011) A novel method for super imposed text extraction in a sports video. Int J Comput Appl 15(1):1
Google Scholar
Xiang D, Yan H, Chen X, Cheng Y (2010) Offline Arabic handwriting recognition system based on HMM. In: 2010 3rd International conference on computer science and information technology
Yang C, Pei W, Wu L, Yin X (2018) Chinese text-line detection from web videos with fully convolutional networks. Big Data Anal 3(2):1
Google Scholar
Ye Q, Doermann D (2015) Text detection recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500
Article Google Scholar
Yin XC, Pei WY, Zhang J, Hao H (2015) Multi-orientation scene text detection with adaptive clustering. IEEE Trans Pattern Anal Mach Intell 37(9):1930–1937
Article Google Scholar
Zamberletti A, Noce L, Gallo I (2014) Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions. In: Asian conference on computer vision, pp 91–105
Zhang Z, Shen W, Yao C, Bai X (2015) Symmetry based text line detection in natural scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2558–2567
Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas: IEEE Computer Society, pp 4159–4167
Zhang S, Liu Y, Jin L, Luo C (2018) Feature enhancement network: a refined scene text detector. In: Thirty-second AAAI conference on artificial intelligence (AAAI-18), pp 2612–2619
Zhong Z, Jin L, Zhang S, Feng Z (2016) DeepText: a unified framework for text proposal generation and text detection in natural images. In: Computer vision and pattern recognition, Cornell University, arXiv:1605.07314
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) EAST: an efficient and accurate scene text detector. In: Computer vision and pattern recognition, Cornell University, arXiv:1704.03155
Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: recent advances and future trends. Front Comput Sci 10(1):19–36
Article Google Scholar

Download references

Acknowledgements

I would like to thank God for his help. Special thanks to the RDI team, Dr. Sven Dickinson, and my faculty department members for supporting me with their experience and data set used in my research.

Author information

Authors and Affiliations

Cairo University, Giza, Egypt
Hossam Elshahaby & Mohsen Rashwan

Authors

Hossam Elshahaby
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen Rashwan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hossam Elshahaby.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elshahaby, H., Rashwan, M. An end to end system for subtitle text extraction from movie videos. J Ambient Intell Human Comput 13, 1853–1865 (2022). https://doi.org/10.1007/s12652-021-02951-1

Download citation

Received: 30 April 2020
Accepted: 04 February 2021
Published: 22 February 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s12652-021-02951-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An end to end system for subtitle text extraction from movie videos

Abstract

Access this article

Similar content being viewed by others

A system for detection of moving caption text in videos: a news use case

Augmented Scene Text Recognition Using Crosswise Feature Extraction

Detection and Tracking of Text from Video Using MSER and SIFT

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An end to end system for subtitle text extraction from movie videos

Abstract

Access this article

Similar content being viewed by others

A system for detection of moving caption text in videos: a news use case

Augmented Scene Text Recognition Using Crosswise Feature Extraction

Detection and Tracking of Text from Video Using MSER and SIFT

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation