Abstract
Recognition of text appearing in videos offers a number of interesting applications including retrieval systems, generation of user alerts on keywords and news summarization systems. Thanks to the recent advancements in deep learning, high text recognition rates have been reported in the recent years. An important step in training such systems is the pre-processing of images for effective feature learning and classification. This study investigates the impact of pre-processing on recognition of cursive video text using Urdu as a case study. The recognition engine relies on a combination of convolutional and long short-term memory networks followed by a connectionist temporal classification layer for sequence alignment. The system is fed with gray scale text line images directly as well as by segmenting the text from background using various thresholding techniques. Experimental study on a dataset of 12,000 text lines in cursive Urdu text reveals that appropriately pre-processing the text line images significantly improves the recognition rates.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Cursive video text
- Binarization
- Convolutional neural networks (CNNs)
- Long short-term memory networks (LSTMs)
1 Introduction
Semantic video retrieval has gained notable research attention in the recent years due to the rapid increase in the digital multimedia data. Typically, images and videos are annotated with different tags or keywords (Fig. 1) which are exploited by search engines for retrieval purposes. A limited set of descriptive tags, however, does not suffice to capture the rich information in images and videos. These limitations served as catalyst to drive the research in content based image and video retrieval systems. More specifically, in the context of video retrieval, smart retrieval systems focus on one or more of the following components.
-
Visual content in the video such as persons, objects and buildings etc.
-
Audio content such as spoken (key)words.
-
Textual content which includes news tickers and subtitles etc.
Among these, we focus on the textual content in videos in the present study. Text appearing in videos can be localized and recognized to support keyword based indexing and retrieval applications. Furthermore, such systems can also be adapted to generate alerts on occurrence of specific keywords (breaking News for example). The development of such systems include two major components, detection and localization of textual content and recognition of text (Video Optical Character Recognition - VOCR), the later being the subject of our research.
Thanks to the recent advancements in deep learning techniques, error rates for tasks like object detection and recognition have dropped significantly. Many of the state-of-the-art object detectors have been adapted to localization of textual content encompassing both caption [27] and natural scene text [32]. The development of large-scale databases with labeled textual content has also been a significant milestone [11]. Likewise, recognition of text has also witnessed a renewed interest of the research community and the latest deep learning based techniques have been employed to develop robust recognition systems [5, 8, 31]. Despite these developments, recognition of cursive text remains a challenging problem. Characters in cursive scripts join to form partial words (ligatures) and character shapes vary depending upon their position within a partial word. Typical examples of cursive scripts include Arabic, Urdu, Persian, Pashto etc. The complexity of the problem is further enhanced if the text appears in video frames (rather than scanned document images). Typical challenges in recognition of video text include low resolution of text, non-homogeneous or complex backgrounds and false joining of adjacent characters into single components.
An important step in recognition of video text is the pre-processing of images. There has been a debate on whether to feed colored, grayscale or binarized images to the learning algorithm. This paper presents an analytical study to investigate the impact of pre-processing on recognition of cursive video text. More specifically, we employ a combination of convolutional neural networks (CNN) and bidirectional long short-term memory (LSTM) networks for recognition of cursive text. The hybrid CNN-LSTM combination is fed directly with gray scale text line images as well as by segmenting the text from background. For segmentation of text and background, global as well as a number of local thresholding techniques are studied. The study reveals that pre-processing the images appropriately prior to training the models results in enhanced recognition rates. We employ video text lines in cursive Urdu as a case study but the findings can be generalized to other cursive scripts as well.
This paper is organized as follows. The next section presents an overview of video text recognition in cursive scripts followed by an introduction to our dataset in Sect. 3. Section 4 introduces the pre-processing and the recognition techniques employed while Sect. 5 details the experimental setup, the reported results and the accompanying discussion. Finally, we provide our concluding remarks in Sect. 6 of the paper.
2 Related Work
Recognition of text, commonly known as Optical Character Recognition (OCR), has remained one of the most investigated pattern classification problems. Thanks to the research endeavors spanned over decades, state-of-the-art recognition systems have been developed for printed documents [12], handwriting [19], natural text in scene images [21] and artificial (caption) text appearing in videos [3].
As discussed earlier, cursive text offers a more challenging recognition problem due to the complexity of the script. Recognition techniques for cursive scripts are generally divided into holistic (segmentation-free) and analytical (segmentation-based) methods. Holistic techniques employ partial words as recognition units while analytical techniques aim to recognize individual characters which are either segmented explicitly or implicitly [8]. Implicit segmentation refers to feeding the text lines and ground truth transcriptions to the learning algorithm to itself learn the character shapes and segmentation points [14, 15]. Such techniques have remained a popular choice of researchers as explicit segmentation of cursive text into characters is highly challenging.
Among notable studies on recognition of cursive video text, Zayene et al. [31] employs long short-term memory networks (LSTMS) to recognize Arabic text in video frames. The technique was evaluated on two datasets, ALIF [28] and ACTiV [30] and high recognition rates were reported. Halima et al. [7] present a system to localize and recognize Arabic text in video frames. The recognition engine exploits a set of statistical features with nearest neighbor classifer to recognize the partial words. In another comprehensive study [9], an end-to-end system is presented for recognition of Arabic text in videos and natural scenes. The system relies on a combination of CNN and RNN for recognition of text. Likewise, Yousfi et al. [28] employ CNNs with deep auto-encoders to compute features using multiple scales. The feature sequences are fed to a recurrent neural network for prediction of transcription. The technique is evaluated on collection of videos from a number of Arabic TV channels and reports promising recognition rates. In another interesting work [29], authors focus on improving the performance of LSTM based recognition engines by employing recurrent connectionist language models. An improvement of 16% in word recognition rates with respect to baseline methods is demonstrated by the introduction of the proposed models.
With reference to Urdu text, a number of robust techniques have been presented for recognition of printed document images. The holistic recognition techniques reported in the literature mostly employ hidden Markov models to recognize partial words (ligatures) [2]. A major issue in holistic techniques is the large number (in thousands) of ligature classes to be recognized. An effective technique is to separate the main body of ligatures from dots and diacritics to reduce the number of classes as a many partial words share the same main body and differ only in the number of positioning of dots and diacritics. After recognition, dots are re-associated with their parent main body component as a post-processing step [5].
Similar to Arabic text recognition, implicit segmentation based techniques have been widely employed for recognition of Urdu text as well. These techniques typically employ LSTMs with a Connectionist Temporal Classification (CTC) layer. The network is fed either with raw pixels [16] or with feature sequences extracted by a CNN [15]. The literature is relatively limited when it comes to recognition systems for Urdu text appearing in videos. A recent work by Tayyab et al. [25] employs bidirectional LSTMs to recognize news tickers in various Urdu news channels. In another related work, Hayat et al. [8] present a holistic technique to recognize Urdu ligatures from video text. Ligature images are first grouped into clusters and convolutional neural networks are trained to recognize the ligature classes. Though the system reports very high recognition rates, the number of ligature classes considered in this study is fairly small (few hundred only).
A critical review of the work on recognition of cursive scripts reveal that implicit segmentation based techniques have proven to be more effective as opposed to holistic recognition techniques. Such techniques not only avoid extraction of partial words from lines of text, they also do not require the training data to be prepared into cluster of partial words. The text lines and corresponding ground truth is directly fed to the learning algorithm making the technique simple yet effective. In our study, we have also chosen to employ an implicit segmentation based recognition technique. Prior to presenting its details, we first introduce the dataset considered in our work in the next section.
3 Dataset
Benchmark datasets have been developed (and labeled) for recognition of printed Urdu text. Two well-known such datasets include CLE [1] and UPTI [20] datasets which have been widely employed for evaluation of recognition systems targeting printed Urdu text. From the view point of video text, datasets like ALIF [28] and ACTiV [30] have been developed for cursive Arabic text. A small dataset of video frames containing artificial text in Urdu is presented in [24]. The number of text lines in the dataset, however, is fairly limited and cannot be employed to train deep neural networks. As a part of our endeavors towards the development of a comprehensive video retrieval engine, we are in the process of developing and labeling a large database of video frames with occurrences of artificial text. We have collected more than 400Â h of video from various News channels and presently, more than 10,000 video frames have been labeled. The ground truth information associated with each frame includes some meta data, location of each text line (bounding box), script information and the actual transcription of text. The dataset will be made publicly available once the labeling process is complete. More details on the dataset and the labeling process can be found in our previous work [13].
For the present study, we consider 12,000 text lines which are extracted from the video frames using ground truth information. Since the focus of this study is on recognition and not on localization, the localization information in the ground truth file is used to extract the text lines. For all experiments, 10,000 text lines are used in the training set and 2,000 in the test set. Sample text regions extracted from the frames are illustrated in Fig. 2.
4 Methods
This section presents the details of the pre-processing and recognition techniques employed in our work. As discussed earlier, the key objective of this study is to investigate the impact of pre-processing on the recognition performance. The recognition engine is fed with gray-scale images as well as by extracting the text using various thresholding techniques. For recognition, a hybrid model of convolutional and long short-term memory networks is employed. Details on pre-processing and recognition are presented in the following.
4.1 Pre-Processing
Presenting the recognition engine with appropriate data is an important step that directly affects the recognition performance. Since the key task of the learning algorithm is to learn character shapes and boundaries, color information is generally discarded as it may falsely lead the algorithm to learn color information rather than the shapes. The idea is further strengthened by the fact that text is readable by humans without color information. Consequently, as a first step, all images are converted to gray scale. A major issue affecting the recognition performance is the non-homogeneous background of text as it can be seen from Fig. 2. Furthermore, the polarity of text can be bright text on dark background or dark text on bright background. The learning algorithm must be provided with images having a consistent text polarity. It is also known that binarizing the text lines can be useful but in some cases imperfect binarization can lead to deterioration in the recognition performance. These and similar issues served to be the motivation of our investigations in the current study.
We start with identifying the polarity of the text. For all experiments, we assume the convention of dark text on bright background. To detect the polarity of a given text line, we apply Canny edge detector on the gray scale image to detect blobs in the image. These blobs correspond to (approximate) text regions in the image. Region filling is applied to these blobs and the resulting image is used as a mask to extract the corresponding regions from the original gray scale image. We then compute the median gray value (\(Med_{text}\)) of the extracted blobs as well as the median gray value of the background (all pixels which do not belong to any blob), \(Med_{back}\). If \(Med_{text} < Med_{back}\) we have dark text on bright background and the polarity agrees with our assumed convention. On the other hand, if \(Med_{text} > Med_{back}\), this corresponds to bright text on dark background. In such cases, the polarity of the image is reversed prior to any further processing. The process is summarized in Fig. 3.
Once all text lines contain text in the same polarity, they can either be directly fed to the recognition module or first binarized to extract only the textual information. For binarization, we investigated a number of thresholding techniques. These include the Otsu’s global thresholding method [18] as well as a number of local thresholding algorithms. The local thresholding algorithms are adaptive techniques where the threshold value of each pixel is computed as a function of the neighboring pixels. Most of these algorithms are inspired from the classical Niblack thresholding [17] where the threshold is computed as a function of the mean and standard deviation of the gray values in the neighborhood of a reference pixel. Other algorithms investigated in our study include Sauvola [22], Feng’s [6] and Wolf’s thresholding algorithm [26]. Prior to binarizing the images, we also apply a smoothing (median) filter on each text line to remove/suppress any noisy patterns in the image. Binarization results of applying various thresholding techniques to a sample text line image are illsutrated in Fig. 4. From the subjective analysis of these results, Wolf’a algorithm that was specifically proposed for low resolution video text seems to outperform other techniques. Nevertheless, it is hard to generalize from visual inspection of few sample images and the recognition rates on images generated by each of these techniques could be a better indicative of the effectiveness of the method.
4.2 Text Recognition
As discussed earlier, we employ an implicit segmentation based recognition technique that does not require segmenting partial words into characters explicitly. More specifically, we employ convolutional neural networks to extract feature sequences from text line images. These sequences along with ground truth transcription are fed to a bidirectional long short-term memory network. This hybrid architecture is often referred to as C-RNN in the literature [4, 10] and has shown promising results on recognition problems [23]. A CTC layer is also employed for alignment of ground truth transcription with the corresponding feature sequences. Figure 5 presents an overview of the recognition engine illsutrating C-RNN with CTC layer.
5 Experiments and Results
This section presents the details of the recognition results using various binarization techniques. As mentioned earlier, for all experiments, 10,000 text line images are used in the training set and 2,000 in the test set. The recognition engine outputs the predicted transcription of query text line. To quantify the recognition performance, we calculate the levenshtein distance between the predicted and the ground truth transcription to compute the character recognition rates.
For a meaningful comparison of various binarization techniques, we employ the same network architecture and the same set of hyper-parameters for each of the experiments. In the first experiment, we compute the recognition rates directly on the gray scale images of text lines and achieve a character recognition rate of 83.48% on the 2,000 test lines. Subsequently, we evaluated the recognition engine with lines binarized through the set of binarization techniques discussed earlier. The recognition rates realized with various binarization techniques are summarized in Table 1.
A number of interesting observations can be made from the reported recognition rates. The gray scale text line images report higher recognition rates once compared to those obtained on text lines binarized using Niblack and Otsu’s thresholding algorithms. This observation is consistent with out initial assessment of binarization algorithms where, in general, Niblack’s binarization introduces a lot of noise in the binarized images while global thresholding fails once the text images have non-homogeneous backgrounds. The performance of Feng’s and Sauvola’s binarization methods is more or less similar reading 86.41% and 85.75% respectively. Text lines binarized using Wolf’s algorithm report the highest recognition rate of 93.48%. This observation is also consistent with the subjective analysis of binarization techniques where Wolf’s algorithm produced relatively cleaner versions of binarized images. Furthermore, the algorithm was specifically designed for binarization of video text (mostly in French) and the current results also validate its superiority over other techniques for recognition of cursive text as well.
To provide an insight into recognition errors, we illustrate (in Fig. 6) the predicted transcription of sample text line binarized using various thresholding techniques. Although all algorithms report recognition errors, it is interesting to note that due to noisy binarization in case of Niblack and Otsu’s thresholding, the predicted and the actual characters are very different. On the other hand, the morphological similarity between the predicted and the actual characters seems to be high in case of other techniques, Wolf’s algorithm for instance.
We also carried out a series of experiments to study the impact of size of training data on the recognition performance. Keeping the test size fixed to 2,000 lines of text, we varied the number of training text lines from 2,000 to 10,000. The corresponding recognition rates are illustrated in Fig. 7 where it can be seen that the recognition rate being to stabilize from 7,000 lines of text which is a manageable size for such applications.
It can be concluded from the realized recognition rates that pre-processing is a critical step in recognition systems that has a significant impact on the recognition performance. Enhancing this step can lead to improved recognition rates. Binarizing the images appropriately resulted in an increase of 10% (from 83.48% to 93.48%) in recognition rate with respect to what is reported on the gray scale images. While deep learning based recognition systems represent state-of-the-art solutions, it is important to feed these systems with appropriately pre-processed data to achieve the performance that is at par with the expectations of commercial applications.
6 Conclusion
This paper investigated the impact of pre-processing on recognition of cursive video text. We employed Urdu text appearing in video frames as a case study but the findings can be generalized to other cursive scripts as well. The recognition engine comprises a combination of convolutional and long short-term memory networks followed by a connectionist temporal classification layer. The network is trained using text line images in gray scale as well as by segmenting text from background using various binarization techniques. Experiments on a a dataset of 12,000 text line images revealed that appropriate pre-processing of text lines significantly enhances the recognition performance.
In our further study on this problem, we intend to continue the labeling process to develop and make publicly available a large dataset of 30,000 labeled video frames. The present study targeted the recognition part only which is planned to be integrated with the text localization module. This in turn will allow development of a comprehensive textual content based retrieval system. From the view point of recognition, similar to pre-processing, we also aim to investigate the impact of data augmentation on the recognition performance.
References
Center for language engineering. http://www.cle.org.pk, accessed: 2019-04-15
Ahmad, I., Mahmoud, S.A., Fink, G.A.: Open-vocabulary recognition of machine-printed Arabic text using hidden markov models. Pattern Recogn. 51, 97–111 (2016)
Bhunia, A.K., Kumar, G., Roy, P.P., Balasubramanian, R., Pal, U.: Text recognition in scene image and video frame using color channel selection. Multimedia Tools Appl. 77(7), 8551–8578 (2018)
Choi, K., Fazekas, G., Sandler, M., Cho, K.: Convolutional recurrent neural networks for music classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2392–2396. IEEE (2017)
Din, I.U., Siddiqi, I., Khalid, S., Azam, T.: Segmentation-free optical character recognition for printed Urdu text. EURASIP J. Image Video Process. 2017(1), 62 (2017)
Feng, M.L., Tan, Y.P.: Contrast adaptive binarization of low quality document images. IEICE Electron. Express 1(16), 501–506 (2004)
Halima, M.B., Karray, H., Alimi, A.M.: Arabic text recognition in video sequences. arXiv preprint arXiv:1308.3243 (2013)
Hayat, U., Aatif, M., Zeeshan, O., Siddiqi, I.: Ligature recognition in Urdu caption text using deep convolutional neural networks. In: 2018 14th International Conference on Emerging Technologies (ICET), pp. 1–6. IEEE (2018)
Jain, M., Mathew, M., Jawahar, C.: Unconstrained scene text and video text recognition for Arabic script. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR), pp. 26–30. IEEE (2017)
Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3367–3375 (2015)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Märgner, V., Pal, U., Antonacopoulos, A., et al.: Document analysis and text recognition (2018)
Mirza, A., Fayyaz, M., Seher, Z., Siddiqi, I.: Urdu caption text detection using textural features. In: Proceedings of the 2nd Mediterranean Conference on Pattern Recognition and Artificial Intelligence, pp. 70–75. ACM (2018)
Naz, S., Umar, A.I., Ahmad, R., Ahmed, S.B., Shirazi, S.H., Razzak, M.I.: Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features. Neural Comput. Appl. 28(2), 219–231 (2017)
Naz, S., et al.: Urdu Nastaliq recognition using convolutional-recursive deep learning. Neurocomputing 243, 80–87 (2017)
Naz, S., Umar, A.I., Ahmed, R., Razzak, M.I., Rashid, S.F., Shafait, F.: Urdu Nasta’liq text recognition using implicit segmentation based on multi-dimensional long short term memory neural networks. SpringerPlus 5(1), 2010 (2016)
Niblack, W., et al.: An Introduction to Digital Image Processing, vol. 34. Prentice-Hall, Englewood Cliffs (1986)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Plamondon, R., Srihari, S.N.: Online and off-line handwriting recognition: a comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 63–84 (2000)
Sabbour, N., Shafait, F.: A segmentation-free approach to Arabic and Urdu OCR. In: IS&T/SPIE Electronic Imaging, p. 86580N. International Society for Optics and Photonics (2013)
Saranya, K.C., Singhal, V.: Real-time prototype of driver assistance system for indian road signs. In: Reddy, M.S., Viswanath, K., K.M., S.P. (eds.) International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications. AISC, vol. 628, pp. 147–155. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5272-9_14
Sauvola, J., Pietikäinen, M.: Adaptive document image binarization. Pattern Recogn. 33(2), 225–236 (2000)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Siddiqi, I., Raza, A.: A database of artificial Urdu text in video images with semi-automatic text line labeling scheme. In: MMEDIA 2012, The Fourth International Conferences on Advances in Multimedia, pp. 75–81 (2012)
Tayyab, B.U., Naeem, M.F., Ul-Hasan, A., Shafait, F., et al.: A multi-faceted OCR framework for artificial Urdu news ticker text recognition. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 211–216. IEEE (2018)
Wolf, C., Jolion, J.M.: Extraction and recognition of artificial text in multimedia documents. Formal Pattern Anal. Appl. 6(4), 309–326 (2004)
Yan, X., et al.: End-to-end subtitle detection and recognition for videos in East Asian Languages via CNN ensemble. Signal Process. Image Commun. 60, 131–143 (2018)
Yousfi, S., Berrani, S.A., Garcia, C.: Deep learning and recurrent connectionist-based approaches for Arabic text recognition in videos. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1026–1030. IEEE (2015)
Yousfi, S., Berrani, S.A., Garcia, C.: Contribution of recurrent connectionist language models in improving LSTM-based arabic text recognition in videos. Pattern Recogn. 64, 245–254 (2017)
Zayene, O., Hennebert, J., Touj, S.M., Ingold, R., Amara, N.E.B.: A dataset for Arabic text detection, tracking and recognition in news videos - AcTiV. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (2015)
Zayene, O., Touj, S.M., Hennebert, J., Ingold, R., Amara, N.E.B.: Multi-dimensional long short-term memory networks for artificial Arabic text recognition in news video. IET Comput. Vision 12(5), 710–719 (2018)
Zhou, X., et al.: East: an efficient and accurate scene text detector. In: Proceedings of the CVPR, pp. 2642–2651 (2017)
Acknowledgment
This study is supported by IGNITE, National Technology Fund, Pakistan under grant number ICTRDF/TR&D/2014/35.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Mirza, A., Siddiqi, I., Mustufa, S.G., Hussain, M. (2019). Impact of Pre-Processing on Recognition of Cursive Video Text. In: Morales, A., Fierrez, J., Sánchez, J., Ribeiro, B. (eds) Pattern Recognition and Image Analysis. IbPRIA 2019. Lecture Notes in Computer Science(), vol 11867. Springer, Cham. https://doi.org/10.1007/978-3-030-31332-6_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-31332-6_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31331-9
Online ISBN: 978-3-030-31332-6
eBook Packages: Computer ScienceComputer Science (R0)