Keywords

1 Introduction

Semantic video retrieval has gained notable research attention in the recent years due to the rapid increase in the digital multimedia data. Typically, images and videos are annotated with different tags or keywords (Fig. 1) which are exploited by search engines for retrieval purposes. A limited set of descriptive tags, however, does not suffice to capture the rich information in images and videos. These limitations served as catalyst to drive the research in content based image and video retrieval systems. More specifically, in the context of video retrieval, smart retrieval systems focus on one or more of the following components.

  • Visual content in the video such as persons, objects and buildings etc.

  • Audio content such as spoken (key)words.

  • Textual content which includes news tickers and subtitles etc.

Fig. 1.
figure 1

Tags associated with a video are used for retrieval purposes

Among these, we focus on the textual content in videos in the present study. Text appearing in videos can be localized and recognized to support keyword based indexing and retrieval applications. Furthermore, such systems can also be adapted to generate alerts on occurrence of specific keywords (breaking News for example). The development of such systems include two major components, detection and localization of textual content and recognition of text (Video Optical Character Recognition - VOCR), the later being the subject of our research.

Thanks to the recent advancements in deep learning techniques, error rates for tasks like object detection and recognition have dropped significantly. Many of the state-of-the-art object detectors have been adapted to localization of textual content encompassing both caption [27] and natural scene text [32]. The development of large-scale databases with labeled textual content has also been a significant milestone [11]. Likewise, recognition of text has also witnessed a renewed interest of the research community and the latest deep learning based techniques have been employed to develop robust recognition systems [5, 8, 31]. Despite these developments, recognition of cursive text remains a challenging problem. Characters in cursive scripts join to form partial words (ligatures) and character shapes vary depending upon their position within a partial word. Typical examples of cursive scripts include Arabic, Urdu, Persian, Pashto etc. The complexity of the problem is further enhanced if the text appears in video frames (rather than scanned document images). Typical challenges in recognition of video text include low resolution of text, non-homogeneous or complex backgrounds and false joining of adjacent characters into single components.

An important step in recognition of video text is the pre-processing of images. There has been a debate on whether to feed colored, grayscale or binarized images to the learning algorithm. This paper presents an analytical study to investigate the impact of pre-processing on recognition of cursive video text. More specifically, we employ a combination of convolutional neural networks (CNN) and bidirectional long short-term memory (LSTM) networks for recognition of cursive text. The hybrid CNN-LSTM combination is fed directly with gray scale text line images as well as by segmenting the text from background. For segmentation of text and background, global as well as a number of local thresholding techniques are studied. The study reveals that pre-processing the images appropriately prior to training the models results in enhanced recognition rates. We employ video text lines in cursive Urdu as a case study but the findings can be generalized to other cursive scripts as well.

This paper is organized as follows. The next section presents an overview of video text recognition in cursive scripts followed by an introduction to our dataset in Sect. 3. Section 4 introduces the pre-processing and the recognition techniques employed while Sect. 5 details the experimental setup, the reported results and the accompanying discussion. Finally, we provide our concluding remarks in Sect. 6 of the paper.

2 Related Work

Recognition of text, commonly known as Optical Character Recognition (OCR), has remained one of the most investigated pattern classification problems. Thanks to the research endeavors spanned over decades, state-of-the-art recognition systems have been developed for printed documents [12], handwriting [19], natural text in scene images [21] and artificial (caption) text appearing in videos [3].

As discussed earlier, cursive text offers a more challenging recognition problem due to the complexity of the script. Recognition techniques for cursive scripts are generally divided into holistic (segmentation-free) and analytical (segmentation-based) methods. Holistic techniques employ partial words as recognition units while analytical techniques aim to recognize individual characters which are either segmented explicitly or implicitly [8]. Implicit segmentation refers to feeding the text lines and ground truth transcriptions to the learning algorithm to itself learn the character shapes and segmentation points [14, 15]. Such techniques have remained a popular choice of researchers as explicit segmentation of cursive text into characters is highly challenging.

Among notable studies on recognition of cursive video text, Zayene et al. [31] employs long short-term memory networks (LSTMS) to recognize Arabic text in video frames. The technique was evaluated on two datasets, ALIF [28] and ACTiV [30] and high recognition rates were reported. Halima et al. [7] present a system to localize and recognize Arabic text in video frames. The recognition engine exploits a set of statistical features with nearest neighbor classifer to recognize the partial words. In another comprehensive study [9], an end-to-end system is presented for recognition of Arabic text in videos and natural scenes. The system relies on a combination of CNN and RNN for recognition of text. Likewise, Yousfi et al. [28] employ CNNs with deep auto-encoders to compute features using multiple scales. The feature sequences are fed to a recurrent neural network for prediction of transcription. The technique is evaluated on collection of videos from a number of Arabic TV channels and reports promising recognition rates. In another interesting work [29], authors focus on improving the performance of LSTM based recognition engines by employing recurrent connectionist language models. An improvement of 16% in word recognition rates with respect to baseline methods is demonstrated by the introduction of the proposed models.

With reference to Urdu text, a number of robust techniques have been presented for recognition of printed document images. The holistic recognition techniques reported in the literature mostly employ hidden Markov models to recognize partial words (ligatures) [2]. A major issue in holistic techniques is the large number (in thousands) of ligature classes to be recognized. An effective technique is to separate the main body of ligatures from dots and diacritics to reduce the number of classes as a many partial words share the same main body and differ only in the number of positioning of dots and diacritics. After recognition, dots are re-associated with their parent main body component as a post-processing step [5].

Similar to Arabic text recognition, implicit segmentation based techniques have been widely employed for recognition of Urdu text as well. These techniques typically employ LSTMs with a Connectionist Temporal Classification (CTC) layer. The network is fed either with raw pixels [16] or with feature sequences extracted by a CNN [15]. The literature is relatively limited when it comes to recognition systems for Urdu text appearing in videos. A recent work by Tayyab et al. [25] employs bidirectional LSTMs to recognize news tickers in various Urdu news channels. In another related work, Hayat et al. [8] present a holistic technique to recognize Urdu ligatures from video text. Ligature images are first grouped into clusters and convolutional neural networks are trained to recognize the ligature classes. Though the system reports very high recognition rates, the number of ligature classes considered in this study is fairly small (few hundred only).

A critical review of the work on recognition of cursive scripts reveal that implicit segmentation based techniques have proven to be more effective as opposed to holistic recognition techniques. Such techniques not only avoid extraction of partial words from lines of text, they also do not require the training data to be prepared into cluster of partial words. The text lines and corresponding ground truth is directly fed to the learning algorithm making the technique simple yet effective. In our study, we have also chosen to employ an implicit segmentation based recognition technique. Prior to presenting its details, we first introduce the dataset considered in our work in the next section.

3 Dataset

Benchmark datasets have been developed (and labeled) for recognition of printed Urdu text. Two well-known such datasets include CLE [1] and UPTI [20] datasets which have been widely employed for evaluation of recognition systems targeting printed Urdu text. From the view point of video text, datasets like ALIF [28] and ACTiV [30] have been developed for cursive Arabic text. A small dataset of video frames containing artificial text in Urdu is presented in [24]. The number of text lines in the dataset, however, is fairly limited and cannot be employed to train deep neural networks. As a part of our endeavors towards the development of a comprehensive video retrieval engine, we are in the process of developing and labeling a large database of video frames with occurrences of artificial text. We have collected more than 400 h of video from various News channels and presently, more than 10,000 video frames have been labeled. The ground truth information associated with each frame includes some meta data, location of each text line (bounding box), script information and the actual transcription of text. The dataset will be made publicly available once the labeling process is complete. More details on the dataset and the labeling process can be found in our previous work [13].

For the present study, we consider 12,000 text lines which are extracted from the video frames using ground truth information. Since the focus of this study is on recognition and not on localization, the localization information in the ground truth file is used to extract the text lines. For all experiments, 10,000 text lines are used in the training set and 2,000 in the test set. Sample text regions extracted from the frames are illustrated in Fig. 2.

Fig. 2.
figure 2

Sample text lines extracted from video frames

4 Methods

This section presents the details of the pre-processing and recognition techniques employed in our work. As discussed earlier, the key objective of this study is to investigate the impact of pre-processing on the recognition performance. The recognition engine is fed with gray-scale images as well as by extracting the text using various thresholding techniques. For recognition, a hybrid model of convolutional and long short-term memory networks is employed. Details on pre-processing and recognition are presented in the following.

4.1 Pre-Processing

Presenting the recognition engine with appropriate data is an important step that directly affects the recognition performance. Since the key task of the learning algorithm is to learn character shapes and boundaries, color information is generally discarded as it may falsely lead the algorithm to learn color information rather than the shapes. The idea is further strengthened by the fact that text is readable by humans without color information. Consequently, as a first step, all images are converted to gray scale. A major issue affecting the recognition performance is the non-homogeneous background of text as it can be seen from Fig. 2. Furthermore, the polarity of text can be bright text on dark background or dark text on bright background. The learning algorithm must be provided with images having a consistent text polarity. It is also known that binarizing the text lines can be useful but in some cases imperfect binarization can lead to deterioration in the recognition performance. These and similar issues served to be the motivation of our investigations in the current study.

We start with identifying the polarity of the text. For all experiments, we assume the convention of dark text on bright background. To detect the polarity of a given text line, we apply Canny edge detector on the gray scale image to detect blobs in the image. These blobs correspond to (approximate) text regions in the image. Region filling is applied to these blobs and the resulting image is used as a mask to extract the corresponding regions from the original gray scale image. We then compute the median gray value (\(Med_{text}\)) of the extracted blobs as well as the median gray value of the background (all pixels which do not belong to any blob), \(Med_{back}\). If \(Med_{text} < Med_{back}\) we have dark text on bright background and the polarity agrees with our assumed convention. On the other hand, if \(Med_{text} > Med_{back}\), this corresponds to bright text on dark background. In such cases, the polarity of the image is reversed prior to any further processing. The process is summarized in Fig. 3.

Fig. 3.
figure 3

Identification of polarity of text (a): Original image (b): Gray scale image (c): Text blobs (d): Filled text blobs serving as a mask to extract corresponding blobs from the gray image; In this case \(Med_{text} < Med_{back}\) hence the image is not inverted.

Once all text lines contain text in the same polarity, they can either be directly fed to the recognition module or first binarized to extract only the textual information. For binarization, we investigated a number of thresholding techniques. These include the Otsu’s global thresholding method [18] as well as a number of local thresholding algorithms. The local thresholding algorithms are adaptive techniques where the threshold value of each pixel is computed as a function of the neighboring pixels. Most of these algorithms are inspired from the classical Niblack thresholding [17] where the threshold is computed as a function of the mean and standard deviation of the gray values in the neighborhood of a reference pixel. Other algorithms investigated in our study include Sauvola [22], Feng’s [6] and Wolf’s thresholding algorithm [26]. Prior to binarizing the images, we also apply a smoothing (median) filter on each text line to remove/suppress any noisy patterns in the image. Binarization results of applying various thresholding techniques to a sample text line image are illsutrated in Fig. 4. From the subjective analysis of these results, Wolf’a algorithm that was specifically proposed for low resolution video text seems to outperform other techniques. Nevertheless, it is hard to generalize from visual inspection of few sample images and the recognition rates on images generated by each of these techniques could be a better indicative of the effectiveness of the method.

Fig. 4.
figure 4

Binarization results on a sample text line (a): Niblack (b): Ostu’s Global Thresholding (c): Feng’s Algorithm (d): Sauvola (e): Wolf’s Algorithm

4.2 Text Recognition

As discussed earlier, we employ an implicit segmentation based recognition technique that does not require segmenting partial words into characters explicitly. More specifically, we employ convolutional neural networks to extract feature sequences from text line images. These sequences along with ground truth transcription are fed to a bidirectional long short-term memory network. This hybrid architecture is often referred to as C-RNN in the literature [4, 10] and has shown promising results on recognition problems [23]. A CTC layer is also employed for alignment of ground truth transcription with the corresponding feature sequences. Figure 5 presents an overview of the recognition engine illsutrating C-RNN with CTC layer.

Fig. 5.
figure 5

Recognition of text line imags - C-RNN with CTC Layer

5 Experiments and Results

This section presents the details of the recognition results using various binarization techniques. As mentioned earlier, for all experiments, 10,000 text line images are used in the training set and 2,000 in the test set. The recognition engine outputs the predicted transcription of query text line. To quantify the recognition performance, we calculate the levenshtein distance between the predicted and the ground truth transcription to compute the character recognition rates.

For a meaningful comparison of various binarization techniques, we employ the same network architecture and the same set of hyper-parameters for each of the experiments. In the first experiment, we compute the recognition rates directly on the gray scale images of text lines and achieve a character recognition rate of 83.48% on the 2,000 test lines. Subsequently, we evaluated the recognition engine with lines binarized through the set of binarization techniques discussed earlier. The recognition rates realized with various binarization techniques are summarized in Table 1.

A number of interesting observations can be made from the reported recognition rates. The gray scale text line images report higher recognition rates once compared to those obtained on text lines binarized using Niblack and Otsu’s thresholding algorithms. This observation is consistent with out initial assessment of binarization algorithms where, in general, Niblack’s binarization introduces a lot of noise in the binarized images while global thresholding fails once the text images have non-homogeneous backgrounds. The performance of Feng’s and Sauvola’s binarization methods is more or less similar reading 86.41% and 85.75% respectively. Text lines binarized using Wolf’s algorithm report the highest recognition rate of 93.48%. This observation is also consistent with the subjective analysis of binarization techniques where Wolf’s algorithm produced relatively cleaner versions of binarized images. Furthermore, the algorithm was specifically designed for binarization of video text (mostly in French) and the current results also validate its superiority over other techniques for recognition of cursive text as well.

To provide an insight into recognition errors, we illustrate (in Fig. 6) the predicted transcription of sample text line binarized using various thresholding techniques. Although all algorithms report recognition errors, it is interesting to note that due to noisy binarization in case of Niblack and Otsu’s thresholding, the predicted and the actual characters are very different. On the other hand, the morphological similarity between the predicted and the actual characters seems to be high in case of other techniques, Wolf’s algorithm for instance.

Table 1. Summary of Recognition Rates on Grayscale and Different Binarizated Images
Fig. 6.
figure 6

Recognition errors in predicted transcription of a sample text line

We also carried out a series of experiments to study the impact of size of training data on the recognition performance. Keeping the test size fixed to 2,000 lines of text, we varied the number of training text lines from 2,000 to 10,000. The corresponding recognition rates are illustrated in Fig. 7 where it can be seen that the recognition rate being to stabilize from 7,000 lines of text which is a manageable size for such applications.

Fig. 7.
figure 7

Recognition rates as a function of size of training data

It can be concluded from the realized recognition rates that pre-processing is a critical step in recognition systems that has a significant impact on the recognition performance. Enhancing this step can lead to improved recognition rates. Binarizing the images appropriately resulted in an increase of 10% (from 83.48% to 93.48%) in recognition rate with respect to what is reported on the gray scale images. While deep learning based recognition systems represent state-of-the-art solutions, it is important to feed these systems with appropriately pre-processed data to achieve the performance that is at par with the expectations of commercial applications.

6 Conclusion

This paper investigated the impact of pre-processing on recognition of cursive video text. We employed Urdu text appearing in video frames as a case study but the findings can be generalized to other cursive scripts as well. The recognition engine comprises a combination of convolutional and long short-term memory networks followed by a connectionist temporal classification layer. The network is trained using text line images in gray scale as well as by segmenting text from background using various binarization techniques. Experiments on a a dataset of 12,000 text line images revealed that appropriate pre-processing of text lines significantly enhances the recognition performance.

In our further study on this problem, we intend to continue the labeling process to develop and make publicly available a large dataset of 30,000 labeled video frames. The present study targeted the recognition part only which is planned to be integrated with the text localization module. This in turn will allow development of a comprehensive textual content based retrieval system. From the view point of recognition, similar to pre-processing, we also aim to investigate the impact of data augmentation on the recognition performance.