Skip to main content
Log in

An end to end system for subtitle text extraction from movie videos

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

A new technique for text detection inside a complex graphical background, its extraction, and enhancement to be easily recognized using the optical character recognition (OCR). The technique uses a deep neural network for feature extraction and classifying the text as containing text or not. An error handling and correction (EHC) technique is used to resolve classification errors. A multiple frame integration (MFI) algorithm is introduced to extract the graphical text from its background. Text enhancement is done by adjusting the contrast, minimize noise, and increasing the pixels resolution. A standalone software Component-Off-The-Shelf (COTS) is used to recognize the text characters and qualify the system performance. Generalization for multilingual text is done with the proposed solution. A newly created dataset containing videos with different languages is collected for this purpose to be used as a benchmark. A new HMVGG16 convolutional neural network (CNN) is used for frame classification as text containing or non-text containing, has accuracy equals to 98%. The introduced system weighted average caption extraction accuracy equals to 96.15%. The correctly detected characters (CDC) average recognition accuracy using the Abbyy SDK OCR engine equals 97.75%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Alves W, Hashimoto R (2010) Text regions extracted from scene images by ultimate attribute opening and decision tree classification. In: Proceedings of the 23rd Sibgrapi conference on graphics, patterns, and images

  • Audithan S, Chandrasekaran RM (2009) Document text extraction from document images using Haar discrete wavelet transform. Eur J Sci Res 36(04):502–512

    Google Scholar 

  • Cho H, Sung M, Jun B (2016) Canny text detector: fast and robust scene text localization algorithm. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3566–3573

  • Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: advances in neural information processing systems, pp 379–387

  • Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware CNN model. In: Proceedings of the IEEE international conference on computer vision, pp 1134–1142

  • Gomez L, Karatzas D (2017) Text proposals: a text specific selective search algorithm for word spotting in the wild. Pattern Recogn 70:60–74

    Article  Google Scholar 

  • Gorinski P, Lapata M (2018) What’s this movie about? A joint neural network architecture for movie content analysis. In: University of Edinburgh, Proceedings of NAACL-HLT, pp 1770–1781

  • Grover S, Arora K, Mitra S (2009) Text extraction from document images using edge information. In: IEEE India Council Conference

  • Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localization in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324

  • Haq I, Muhammad K, Hussain T, Kwon S, Sodanil M, Baik S, Lee M (2019) Movie scene segmentation using object detection and set theory. Int J Distrib Sens Netw 15(6)

  • He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  • He T, Huang W, Qiao Y, Yao J (2016b) Text attentional convolutional neural network for scene text detection. IEEE Trans Image Process 25(6):2529–2541

    Article  MathSciNet  Google Scholar 

  • He P, Huang W, He T, Zhu Q, Qiao Y, Li X (2017) Single shot text detector with regional attention. In: Computer vision and pattern recognition, Cornell University, arXiv:1709.00138

  • Hesham M, Hani B, Fouad N, Amer E (2018) Smart trailer: automatic generation of movie trailer using only subtitles. In: First international workshop on deep and representation learning (IWDRL), IEEE, pp 26–30

  • Hoang T, Tabbone S (2010) Text extraction from graphical document images using sparse representation. In: Proceedings of the 9th IAPR international workshop on document analysis systems, pp 143–150

  • https://pixabay.com/vectors/bitcoin-money-cryptocurrency-4851383/. Accessed 28 Sept 2020

  • https://www.dreamstime.com/photos-images/autonomous-car.html. Accessed 28 Sept 2020

  • https://www.freepik.com/premium-photo/engineer-check-control-welding-robotics-automatic-arms-machine_5284742.htm. Accessed 28 Sept 2020

  • https://www.robots.ox.ac.uk/~vgg/software/textspot/. Accessed 10 June 2020

  • Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced MSER trees. In: European conference on computer vision, Springer, Zurich, pp 497–511

  • Indermühle E, Liwicki M, Bunke H (2010) IAMonDo-database: an online handwritten document database with non-uniform contents. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS ’10), pp 97–104

  • Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1–20

    Article  MathSciNet  Google Scholar 

  • Jung K, Kim E (2004) Automatic text extraction for content-based image indexing. In: Proceedings of PAKDD, pp 497–507

  • Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 845–853

  • Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: AAAI, pp 4161–4167

  • Liu X, Samarabandu J (2006) Multiscale edge-based text extraction from complex images. In: Proceedings of the international conference of multimedia and Expo, pp 1721–1724

  • Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  • Lu Q, Wang Y (2019) Automatic text location of multimedia video for subtitle frame. J Ambient Intell Humaniz Comput

  • Moradi M, Mozaffari S, Orouji A (2010) Farsi/Arabic text extraction from video images by corner detection. In: 2010 6th Iranian conference on machine vision and image processing, pp 1–6

  • Nagabhushan P, Nirmala S (2009) Text extraction in complex color document images for enhanced readability. Intell Inf Manag 2:120–133

    Google Scholar 

  • Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: Computer vision and pattern recognition (CVPR) IEEE conference, pp 3538–3545

  • Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, Santiago: IEEE Computer Society, pp 1520–1528

  • Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  • Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  • Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651

    Article  Google Scholar 

  • Shi J, Tomasi C (1994) Good features to track. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 593–600

  • Shivakumara P, Dutta A, Pal U, Tan C (2010) A new method for handwritten scene text detection in video. In: International conference on frontiers in handwriting recognition, pp 16–18

  • Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas: IEEE Computer Society, arXiv:1604.03540

  • Sun L, Huo Q, Jia W, Chen K (2015) A robust approach for text detection from natural scene images. Pattern Recogn 48(9):2906–2920

    Article  Google Scholar 

  • Tian S, Pan Y, Huang C, Lu S, Yu K, Tan C (2015) Text flow: a unified text detection system in natural scene images. In: Proceedings of the IEEE international conference on computer vision, pp 4651–4659

  • Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision, pp 56–72

  • Vijayakumar V, Nedunchezhianm R (2011) A novel method for super imposed text extraction in a sports video. Int J Comput Appl 15(1):1

    Google Scholar 

  • Xiang D, Yan H, Chen X, Cheng Y (2010) Offline Arabic handwriting recognition system based on HMM. In: 2010 3rd International conference on computer science and information technology

  • Yang C, Pei W, Wu L, Yin X (2018) Chinese text-line detection from web videos with fully convolutional networks. Big Data Anal 3(2):1

    Google Scholar 

  • Ye Q, Doermann D (2015) Text detection recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500

    Article  Google Scholar 

  • Yin XC, Pei WY, Zhang J, Hao H (2015) Multi-orientation scene text detection with adaptive clustering. IEEE Trans Pattern Anal Mach Intell 37(9):1930–1937

    Article  Google Scholar 

  • Zamberletti A, Noce L, Gallo I (2014) Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions. In: Asian conference on computer vision, pp 91–105

  • Zhang Z, Shen W, Yao C, Bai X (2015) Symmetry based text line detection in natural scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2558–2567

  • Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas: IEEE Computer Society, pp 4159–4167

  • Zhang S, Liu Y, Jin L, Luo C (2018) Feature enhancement network: a refined scene text detector. In: Thirty-second AAAI conference on artificial intelligence (AAAI-18), pp 2612–2619

  • Zhong Z, Jin L, Zhang S, Feng Z (2016) DeepText: a unified framework for text proposal generation and text detection in natural images. In: Computer vision and pattern recognition, Cornell University, arXiv:1605.07314

  • Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) EAST: an efficient and accurate scene text detector. In: Computer vision and pattern recognition, Cornell University, arXiv:1704.03155

  • Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: recent advances and future trends. Front Comput Sci 10(1):19–36

    Article  Google Scholar 

Download references

Acknowledgements

I would like to thank God for his help. Special thanks to the RDI team, Dr. Sven Dickinson, and my faculty department members for supporting me with their experience and data set used in my research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hossam Elshahaby.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elshahaby, H., Rashwan, M. An end to end system for subtitle text extraction from movie videos. J Ambient Intell Human Comput 13, 1853–1865 (2022). https://doi.org/10.1007/s12652-021-02951-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-02951-1

Keywords

Navigation