Elsevier

Future Generation Computer Systems

Volume 87, October 2018, Pages 328-340
Future Generation Computer Systems

A novel machine learning approach for scene text extraction

https://doi.org/10.1016/j.future.2018.04.074Get rights and content

Highlights

  • A novel method is proposed for scene text extraction, recognition and correction.

  • MSER technique is used for segmenting text/non-text areas after preprocessing.

  • A feature fusion approach is used for CC classification using SVM and weighted sum.

  • A CNN model is proposed for character labeling and hamming distance for correction.

  • Conclusions and analysis are performed on datasets ICDAR2003, SVT and IIIT5k.

Abstract

Image based text extraction is a popular and challenging research field in computer vision in recent times. In this paper, an exigent aspect such as natural scene text identification and extraction has been investigated due to cluttered background, unstructured scenes, orientations, ambiguities and much more. For text identification, contrast enhancement is done by applying LUV channel on an input image to get perfect stable regions. Then L-Channel is selected for region segmentation using standard segmentation technique MSER. In order to differentiate among text/non-text regions, various geometrical properties are also considered in this work. Further, classification of connected components is performed to obtain segmented image by the fusion of two feature descriptors LBP and T-HOG. Firstly both features descriptors are separately classified using linear SVM(s). Secondly the results of both are combined by applying weighted sum fusion technique to classify into text/non-text portions. In text recognition, text regions are recognized and labeled with a novel CNN network. The CNN output is stored in a text file to make a text word. Finally, the text file is searched through lexicon for proper optimized scene text word incorporating hamming distance (error correction) technique if necessary.

Introduction

Identification and extraction of scene text from natural images and videos have been an essential job in recent computer vision research due to the frequent use and advent of smart gadgets. It also has huge demand in content based image retrieval and understanding. Technically, extraction of text undergoes through two major steps: (a) Text Detection in which it is identified and localized from natural scenes and/or videos. In short, it is a process to determine text/non-text regions. (b) Text Recognition means to understand semantic meaning of the text. In general, text is of two types [1]; (1) scene text which is normally a click of camera and reveals common surroundings. This make scene unstructured and ambiguous due to uncertain situations e.g., advertisement holdings, sign boards, shops, text on buses, face panels and many more, (2) caption or graphic text is added manually to images and/or videos in order to support visual and audio content, so here text extraction is simpler than natural scene text.

The challenges regarding diversity and complexity of natural images for text extraction are addressed from three different angles; (1) variation in natural scene text due to unrestrained and uncontrolled surroundings reflects absolutely diverse font size, style, color, scales and orientation, (2) Background uncertainty having challenges like roads, signs, grass, buildings, bricks and paves etc. All these factors make it difficult to separate actual text from natural image and thus can become source of puzzlement and bugs, (3) Intrusion factors like noise, low quality, distortion, non consistent light also generate problems in natural scene text identification and extraction.

The most important text identification classes are: texture, connected components and hybrid approaches. In texture oriented approach proposed by authors in [2], text is closely related to some class of texture from which certain characteristics such as filter response, wavelength coefficient and native intensities can be processed. However, these methods thoroughly scan all locations and levels hence proved computationally costly. Connected component approach published in [3] carries out edge identification earlier and then applies down to up method to join minor regions into bigger regions unless all regions are detected. In addition, geometrical characteristics such as eccentricity, solidity, aspect ratio, Euler number, extent and some heuristics are used to combine text regions to extract and confine the text. Hybrid approaches are a combination of texture oriented and connected component oriented approaches.

The aim and uniqueness of proposed work is that it not only recognizes the text intelligently but also removes error in order to preserve actual meaning of the text. As per reviewing the literature such work is maiden effort in this scenario.

Recently, MSER [[4], [5]], SWT [6] and binarization techniques [7] are popular for extracting text from natural scene. In the proposed methodology, MSER is used to extract text after enhancing the contrast of a scene image while applying LUV-channel and selecting L-channel in gray scale for finding stable regions as a preprocessing step. Secondly, a novel approach is used to perform classification of MSER components into text and non-text by applying a feature descriptor which is a combination of LBP, T-HOG and SVM as classifier. Thirdly, CNN architecture is proposed for character recognition and character labeling. The output of CNN is stored in a text file to make a string of words. Task is completed if the output of text file matches the scene text otherwise error correction is handled using hamming distance technique. The complete framework will be described thoroughly in the coming subsequent portions of this work.

This work has been evaluated on standard datasets in the respective domain and finds out that it performs well on all standard datasets primarily with increasing accuracy along with additional calculation of precision, recall and f-score. Moreover, the application is programmed in C and Python on Ubuntu platform. The combination of these languages as a single entity gave remarkable results. The first portion text detection presented in Fig. 1 is programmed in C, while the second portion text recognition is carried out using Python language.

Section snippets

Related work

The researches have proposed several methods for identifying and detecting text in scene/videos in the last two decades. As mentioned earlier there are mainly three classes: texture oriented, connected component oriented and hybrid oriented.

Texture based methods [[6], [7], [8], [9]] as described earlier are computationally costly due to textural properties (local intensities, filter response, wavelength coefficients) to identify non-text and text positions in natural scene. In addition, these

Objectives and contribution

A novel scene text extraction system is presented in this article which enables to identify and recognize the scene text intelligently and efficiently. Many issues can arise in scene text detection including font size, font color, font style, orientation, blur, occlusion, opacity and noise etc. Because of all these issues, it is sometimes difficult to train the system to give intelligent decision. In this work, CNN based character recognition and labeling is proposed to recognize and label

Natural scene text recognition model

A complete process diagram of suggested framework is presented in Fig. 1 while Fig. 2 presents pseudo code of the methodology. The proposed work has two major portions: (1) Text detection is comprised of contrast enhancement to detect stable MSER regions after which connected component classification is performed while applying T-HOG and LBP feature descriptors for character grouping. (2) Text recognition portion is executed after words splitting in which major activity is CNN based character

Word recognition and error correction

The output of proposed CNN is a character label each time; this is because character image patches are given as input serially to the network. Each character label in then collected into text files to recognize the word. In short, the actual recognized word is in the text file which is extracted and recognized after applying the proposed technique from Section 4.1 to Section 4.4 on natural scene images. Sometimes, CNN is unable to recognize the character properly which in turn is labeled

Results and experiments

To obtain the conclusions and analysis, the experiments are performed on Intel based core i5 6th generation 3.4 Ghz CPU and Nvidia Geforce GTX 1070 GPU with 8 GB memory and compute capability of 6.1. The training of positive and negative samples is performed by making mini-batches of size 128 each. The total number of epochs used is 30 although the network becomes stable after 25 epochs in most of the times during training. The learning rate and momentum are selected 0.001 and 0.9 respectively.

Conclusion

In this work, a novel way of scene text extraction is suggested under machine learning. By leveraging the primacy of MSER, stable regions are detected on enhanced image using gray scale L-channel. To make robust feature descriptor, LBP and T-HOG feature sets are combined into a single feature vector by weighted sum to identify text and non-text regions perfectly with linear SVM. This helps to detect characters using character grouping of the scene text. Further the output of text detection

Limitations and future work

The proposed methodology has enough space to improve in different ways. First, it is better to perform character level detection and recognition using a single feed forward integrated pipeline rather that stepwise approach in two portions that is text detection and recognition. Secondly, text in scenes are not properly separated from each other especially those which suffer from different kinds of variations, font size and style, viewpoints and blurs. These aspects need to be addressed more

Ghulam Jillani Ansari has attained degree of Bachelor in Computer Science from BZU Multan, Pakistan in 2000 and MS (CS) from University of Agriculture Faisalabad, Pakistan in 2004. He is currently student of Ph.D. (CS) in COMSTAS Wah Cantt, Pakistan and employed as an Assistant Professor at University of Education Lahore (Multan Campus), Pakistan since 2006. His research interests are Object Oriented Programming, Image Processing, Computer Vision, Neural Networks and Machine Learning. Presently

References (47)

  • YiC. et al.

    Text string detection from natural scenes by structure-based partition and grouping

    IEEE Trans. Image Process.

    (2011)
  • KimK.I. et al.

    Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • LeibeB. et al.

    Scale-invariant object categorization using a scale-adaptive mean-shift search

  • ZhongY. et al.

    Automatic caption localization in compressed video

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • ChenX. et al.

    Detecting and reading text in natural scenes

  • P. Viola, M. Jones, Fast and robust classification using asymmetric adaboost and a detector cascade, in: Advances in...
  • W. Huang, et al. Text localization in natural images using stroke feature transform and text covariance descriptors,...
  • ShivakumaraP. et al.

    A laplacian approach to multi-oriented text detection in video

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: Proceedings of the IEEE...
  • HeT.

    Text-attentional convolutional neural network for scene text detection

    IEEE Trans. Image Process.

    (2016)
  • Z. Zhang, et al. Multi-oriented text detection with fully convolutional networks, in: Proceedings of the IEEE...
  • Y. Liu, L. Jin, Deep matching prior network: Toward tighter multi-oriented text detection, 2017. arXiv preprint...
  • LiuY. et al.

    A contour-based robust algorithm for text detection in color images

    IEICE Trans. Inf. Syst.

    (2006)
  • Cited by (53)

    View all citing articles on Scopus

    Ghulam Jillani Ansari has attained degree of Bachelor in Computer Science from BZU Multan, Pakistan in 2000 and MS (CS) from University of Agriculture Faisalabad, Pakistan in 2004. He is currently student of Ph.D. (CS) in COMSTAS Wah Cantt, Pakistan and employed as an Assistant Professor at University of Education Lahore (Multan Campus), Pakistan since 2006. His research interests are Object Oriented Programming, Image Processing, Computer Vision, Neural Networks and Machine Learning. Presently his research area in Ph.D. (CS) is Computer Vision and Graphics.

    Jamal Hussain Shah, Ph.D. is Assistant Professor at COMSATS, Wah Cantt Pakistan. He completed his Ph.D. in Pattern Recognition from University of Science and Technology China, Hefei, P.R China. He completed Masters in Computer Science from COMSATS Wah, Pakistan. His area of specialization is Automation and Pattern Recognition. He is in education field since 2008. He has 21 publications in IF, SCI and ISI journals as well as in national and international conferences. He is currently supervising 4 Ph.D. (CS) students and 6 Masters. He has received COMSATS research productivity award since 2013–2016. His research interests include Deep learning, Algorithms design and Analysis, Machine Learning, Image processing and Big Data.

    Mussarat Yasmin, Ph.D. is Assistant Professor at COMSATS, Wah Cantt Pakistan. Her area of specialization is Image Processing. She is in education field since 1993. She has so far 45 research publications in IF, SCI and ISI journals as well as in national and international conferences. A number of undergraduate projects are complete under her supervision. She is currently supervising five Ph.D. (CS) students. She is gold medalist in MS (CS) from IQRA University, Pakistan. She is getting COMSATS research productivity award since 2012. Her research interests include Neural Network, Algorithms design and Analysis, Machine Learning and Image processing.

    Muhammad Sharif, Ph.D. is Associate Professor at COMSATS, Wah Cantt Pakistan. His area of specialization is Artificial Intelligence and Image Processing. He is into teaching field from 1995 to date. He has 110 plus research publications in IF, SCI and ISI journals and national and international conferences. He has so far supervised 25 MS (CS) thesis. He is currently supervising 5 Ph.D. (CS) students and co-supervisor of 5 others. More than 200 undergraduate students have successfully completed their project work under his supervision. His research interests are Image Processing, Computer Networks & Security and Algorithms Design and Analysis.

    Steven Lawrence Fernandes, Ph.D. is member of Core Research Group, Karnataka Government Research Centre of Sahyadri College of Engineering and Management, Mangalore, Karnataka. He has received Young Scientist Award by Vision Group on Science and Technology, Government of Karnataka. He also received grant from The Institution of Engineers (India), Kolkata for his Research work. His current Ph.D. work, “Match Composite Sketch with Drone Images”, has received patent notification (Patent Application Number: 2983/CHE/2015) from the Government of India.

    View full text