Keywords

1 Introduction

Separation of text foreground from page background is an important processing step for the analysis of various kind of documents e.g. historical, administrative, scanned and camera captured documents. This separation enables various document analysis algorithms (such as character recognition, localization of stamps, logos, signatures etc.) to focus only on the written text without the disturbing background noise. If texts and graphics are located in separate regions in the page, localizing the region of texts and graphics is easier in comparison with texts superimposed on background graphics [1].

In this paper, we are focusing on the proper extraction of all the text components from French university diplomas, which contain a complex background. The objective is to secure the university diplomas and to reduce the possibilities of fraud. The core idea is to encode a unique and secret number in the diploma while printing it. This encoding is done by changing the shape of certain characters and each modified character shape is associated with a specific digit (this association is only known to some specific people of the university). Hence, later to check the authenticity of any scanned student’s diploma, we a need proper and undeformed extraction of text characters which is required for decoding the secret and unique number, associated with each diploma. All French university diplomas (around 300 thousand per year) are printed on the same decorated and authenticate thick paper/parchment (in French called parchemin), only and strictly fabricated by the National Printing House (in French called Imprimerie Nationale). So, any French university diploma has the same background and only the foreground text changes for different universities/institutes (moreover scanned diplomas will diversify due to scanning effects). Moreover, the text fonts, size and text styles also vary for different universities/institutes. In every diploma, each individual student’s information e.g. name, date-of-birth, place-of-birth etc. changes also. All these variations together, makes the binarization process harder. This complex background is highly superimposed with the foreground textual components which makes it complex to separate the textual components from the background (see Fig. 2a for an example of diploma). This problem can also be seen as separation of graphical background from the textual foreground. On the other hand, this problem has a high resemblance with historical degraded document binarization. Historical degradation includes non-uniform background, stains, faded ink, ink bleeding through the page and un-even illumination [2]. The state of the art text/graphics separation approaches do not perform well in our case because unlike general text/graphics separation data e.g. maps, floor plans etc. [1], the texts in diploma images are fully superimposed upon the decorated colored background.

There exist several categories of binarization techniques, which are often evaluated on the dataset of popular document image binarization contest (DIBCO) [11]. But these categories of binarization approaches have several constraints. Many common binarization approaches are based on the calculation of local or global thresholds and image statistics, which doesn’t work for our case (see experimental Section in 4). So as a contribution of this paper, we propose a new algorithm for the separation of text components from graphical background. The algorithm starts by structured forest based fast gradient image formation followed by textual region selection by using image masking for the initial filtering of textual components. Then Fuzzy C-Means clustering is applied to separate textual and non-textual components. As the clustering may result in some deformation of the textual components (characters), hence we also a propose character reconstruction technique by using a local window based thresholding approach and the Savoula binarization algorithm [12]. This technique helps to correctly classify some extra pixels into the category of text pixels, which helps to reconstruct/recover the missing character pixels. As far as we know, this is the very initial proposed attempt of removing background to obtain texts from French university diplomas.

The remainder of this paper is organized as follows. First, Sect. 2 summarizes the work related to text/graphics segmentation in general. Then, Sect. 3 provides an overview of the method proposed. Subsequently, experimental results are described and analyzed in Sect. 4. Finally, Sect. 5 concludes the paper and gives an overview of future work.

Fig. 1.
figure 1

The proposed algorithm architecture. The portions in bold are the main steps of the algorithm while the two non-bold blocks are the auxiliary steps needed for the “reconstruction of damaged characters” block.

2 Related Work

By definition, text binarization means the labeling of each pixel in the image as text or background which has a good resemblance with our problem. Due to the unavailability of any previous work, we have outlined here some of the related work from the domain of document image (mainly historical) binarization. The existing binarization techniques in the literature can be categorized into two principal categories: learning free and learning based approaches. Several binarization techniques have been proposed in the past decade, but very recently the trend in document binarization tends to machine learning based methods, mostly relying on deep learning based techniques.

Most deep learning based image binarization algorithms use convolutional neural networks (CNN), e.g. by Tensmeyer and Martinez [13], or variations thereof, such as the deep supervised network (DSN) approach proposed by Vo et al. [14]. Due to the requirement of sufficient amount of training data and specially in our case, where no ground truth exists, the training based approaches are not useful. Moreover, recently the learning free approaches [8] have shown high potential and comparable accuracy with respect to learning based approaches. In [2], a local binarization method based on a thresholding with dynamic and flexible windows is proposed. Jia et al. [8] proposed an approach based on structural symmetry of pixels (SSP) from text strokes for binarization. Using the combination of SSP and Fuzzy C-Means (FCM) clustering (the FRFCM technique [7]), Mondal et al. [10] also proposed an improved and fast binarization technique for historical document images.

While deep learning methods are getting more and more attention in the community, FCM [7] based models are among the popularly used methods for image segmentation. We avoided to use any learning based method (e.g. deep neural networks) for background removal due to the non availability of a pixel level ground truth (GT). It is also highly cumbersome and expensive to generate such GT for our high resolution experimental data set. These methods rely on the fuzzy set theory which introduce fuzziness for the belongingness of each image pixel to a certain class. Such clustering technique, named FRFCM [7], is used as one of the main processing step for our proposed algorithm. So, here we also provide a brief background and state of the art methods on FCM techniques to justify our choice. It is superior to hard clustering as it has more tolerance to ambiguity and retains better original image information. As FCM only considers gray-level information without considering the spatial information, it fails to segment images with complex texture and background or images corrupted by noise. So, the FCM algorithm with spatial constraint (FCM_S) [7] was proposed, which incorporates spatial information in the objective function. However, FCM_S is time consuming because the spatial neighbor term is computed in each iteration. To reduce the execution time, two modified versions, named as FCM_S1 and FCM_S2 [7] were proposed. These algorithms employ average filtering and median filtering to obtain the spatial neighborhood information in advance. However, both FCM_S1 and FCM_S2 are not robust to Gaussian noise, as well as to a known noise intensity.

The Enhanced FCM (EnFCM) [7] algorithm is an excellent technique from the viewpoint of low computational time as it performs clustering based on gray-level histograms instead of pixels of a summed image. However, the segmentation result by EnFCM is only comparable to that produced by FCM_S. To improve the segmentation results, the Fast Generalized FCM (FGFCM) [7] was proposed. It introduces a new factor as a local similarity measure, which guarantees both noise immunity and a detailed preservation of image segmentation. Along with that it also removes the requirement of empirical parameter \(\alpha \) in EnFCM and performs clustering on gray-level histograms. But FGFCM needs more parameter settings than EnFCM. The robust Fuzzy Local Information C-Means clustering algorithm (FLICM) is introduced in [7], which is free from parameter selection. This algorithm replaces the parameter \(\alpha \) of EnFCM by incorporating a novel fuzzy factor into the objective function to guarantee noise immunity and image detail preservation. Although FLICM overcomes the image segmentation performance, but the fixed spatial distance is not robust to different local information of images. Hence, we use a significantly fast and robust algorithm based on morphological reconstruction and membership filtering (FRFCM) [7] for image segmentation.

The aforementioned state-of-the-art binarization methods are developed for the binarization of historical document images, by testing some of them, we observed unsatisfying results. In the following section, we propose a new method for separating the textual foreground from textured background.

3 Proposed Method

In this section, the proposed algorithm is explained and its overall architecture is shown in Fig. 1. At first, the structural random forest based gradient image formation technique is applied on the image \(\mathcal {I}(x,y)\) (see Fig. 2a).

3.1 Structural Forest Based Fast Gradient Image Formation

A real time gradient computation for the objective of edge detection is proposed by Dollár et al. [4], which is faster and more robust to texture presence than current state-of-the-art approaches. The gradient computation (see Fig. 2b) is performed based on the present structure in local image patches and by learning both an accurate and computationally fast edge detector.

3.2 Anisotropic Diffusion Filtering

An anisotropic diffusion filter [6] is applied on the gradient image (see Sect. 3.1) to reduce image noises (due to scanning) without removing significant contents of the image e.g. edges, lines and other details (see Fig. 2c).

3.3 Selection of Textual Regions from the Original Image

After filtering the image, edges are detected by the Canny edge detector (see Fig. 2d). Only better gradient images are obtained by Dollár et al. [4] (refer to Sect. 3.1), but to generate the binary edges we have applied the Canny edge detection algorithm. The high and low thresholds of Canny are computed automatically as \(\phi \) (equal to Otsu’s threshold value) and \(0.5 \times \phi \). The idea was not to use hand-crafted thresholds.

The edge image is then dilated by using a \(7 \times 7\) rectangular kernelFootnote 1 to connect the broken edges, caused by the Canny algorithm (see Fig. 2e). It can be observed that it contains holes and gaps inside the character shapes (see Fig. 2e). These holes/gaps are filled by applying the well known flood fillFootnote 2 based hole filling approach (see Fig. 2f). Let us denote this hole filled image as \(\mathcal {H}(x,y)\). Now the original gray scale image pixels from the gray scale image corresponding to this hole filled image are simply obtained by creating a blank image of the same size as the original image and initialized by 255 (let’s say \(\mathcal {T}(x,y)\); see Fig. 2g)). This process corresponds to applying a maskFootnote 3 using image \(\mathcal {H}(x,y)\). Now, a clustering technique is applied to the masked image \(\mathcal {T}(x,y)\) for classifying into text pixels and background pixels.

(1)
Fig. 2.
figure 2

Various preprocessing steps: (a) Original gray image (b) Gradient image obtained by [4] (c) Anisotropic filtered image (d) Detected edges by Canny (e) Dilated images after Canny edge detection (f) Hole filled image after dilation (g) Selected textual regions \(\mathcal {T}(x,y)\) (h) Only sure text pixels after clustering (i) Sure and confused pixels after clustering (j) Existing issue of character deformation after clustering (sure text pixels only).

3.4 Fuzzy C-Means (FRFCM) Clustering Algorithm

Following to the aforementioned state of the art on FCM clustering mentioned in Sect. 2, Lei and Jia et al. [7] proposed the FRFCM algorithm with a low computational cost which can achieve good segmentation results for various types of images and can also achieve a high segmentation precision. It employs morphological reconstruction (MR) to smooth the images in order to improve noise immunity and image detail preservation. Results obtained by various aforementioned clustering techniques on a small cropped image of \(822 \times 674\) are shown in Table 1. It can be seen that FRFCM has an impressive computational time.

Therefore, FRFCM is faster than other aforementioned improved FCM algorithms. By setting the number of clusters to 3 (sure text pixels, confused text pixels and background pixels), we apply FRFCM clustering on \(\mathcal {T}(x,y)\) (i.e. on the image shown in Fig. 2g), which gives a clustered image (see Fig. 2i). The two deeper intensities are visible in Fig. 2i except white, which represents the background. Among these two deeper intensities, the darker one represents the intensities of sure text pixels (the image formed by only these pixels are shown in Fig. 2h) and the other ones represent confused pixels. The following techniques are applied to qualify these pixels as texts or background.

Some character shapes are deformed due to clustering (see Fig. 2j), but this problem mainly comes from the gradient image formation followed by the Canny edge detection. Due to the low contrast between the background and foreground at these specific regions of the image, the gradient image fails to clearly signify text regions followed by the failure of Canny’s edge detection. This is shown in Fig. 3a (top left: zoomed portion of dilated image after Canny, top right: hole filled image, bottom left: textual region separation, bottom right: clustered image). Note that even before performing the dilation operation on the Canny image, some portions of the image were already missing. The following technique is applied to recover the missing portions and to reconstruct the image.

3.5 Savoula Binarization

We apply Savoula binarization [12] on the original image at this step of the algorithm (see Fig. 3b). The following thresholding formula is used:

$$\begin{aligned} \begin{aligned} \mathcal {S} = m(1-k(1-\frac{\sigma }{R})) \end{aligned} \end{aligned}$$
(2)

where k is equal to 0.2 and \(R = 125\) is a gray-level range value, and m and \(\sigma \) represent the mean and standard deviation of image window respectively.

Table 1. Time required (in seconds) by the different clustering algorithms on a cropped diploma image.

3.6 Basic and Fast Text Line Segmentation

Text lines are roughly segmented based on a horizontal projection of the binarized image, obtained by the FRFCM clustering (the sure text pixels). After obtaining the heights of each text line, a simple but efficient approach to calculate the average of these values are described in Algorithm algo:averageCalc. This technique is helpful for removing outliers from the set of values which helps to calculate a better average value.Footnote 4 The values are stored in Arr[items] and the intelligent average is obtained in the variable AvgVal.Footnote 5

figure a

3.7 Window Based Image Binarization

We apply a fast window based binarization technique on the original gray image. This technique is inspired by the method in [2]. From the previous step of text line segmentation, we get the very first (textStart) and the last image row (textEnd), which contains text. As we already have the average height of text lines (AvgVal) so by using this information, we divide the considered region (region between textStart and textEnd) into equal sized horizontal stripes. Now, each horizontal stripe is divided into 20 equal partsFootnote 6 and each of these parts called (window) on which the binarization is performed by using the following equations:

$$\begin{aligned} \begin{aligned} \sigma _{adaptive} = [\frac{\sigma _{W} - \sigma _{min}}{\sigma _{max} - \sigma _{min}}] \times max_{Intensity} \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} \mathcal {T}_{W} = \mathcal {M}_{W} - \frac{\mathcal {M}_{W} \times \sigma _{W}}{(\mathcal {M}_{W} + \sigma _{W}) (\sigma _{adaptive} + \sigma _{W})} \end{aligned} \end{aligned}$$
(4)

where \(\sigma _{adaptive}\) is the gray-level value of the adaptive standard deviation of the window. \(\sigma _{W}\) is the standard deviation of the window, \(\sigma _{min}\) and \(\sigma _{max}\) are the minimum and maximum standard deviation of all the windows (20 here), \(max_{Intensity}\) is the maximum intensity value of the horizontal stripe, \(\mathcal {M}_{W}\) is the mean value of the pixel intensities of the window and \(\mathcal {T}_{W}\) is the threshold of the window. The obtained binarized image is denoted by \(WinImg^{binary}(x,y)\) (see Fig. 3c). It can be seen that although the simple Savoula binarization has comparatively performed well, but it fails to remove the centered decorated background (called couronne in French), whereas the binarized image from the window based image binarization is prominently keeping most of the background decorated portions. So, none of these can be directly used for our case.

figure b

3.8 Reconstruction of Deformed Characters

We propose a technique for the reconstruction of deformed characters with the help of the Savoula and the window based binary image. This reconstruction is required as the shape of the characters should remain intact for proper decoding of the secret number, as mentioned in the introduction (see Sect. 1). The obvious text pixels (sure text pixels), obtained from the fuzzy clustering are copied into a the new image (called \(\mathcal {K}(x,y)\)). Now, for each confused pixel (line 7 in Algorithm 2) in the fuzzy clustered image, we check first if the same pixel is a foreground pixel in the Savoula image (\(\Upsilon \)) and also in the window based binary image \((\chi )\). If they are foreground pixels (line 8 in Algorithm 2), then we traverse at the neighborhood of this pixel with the window size of \(winSz \times winSz\) and count the number of such pixels which are foreground both in \(\Upsilon (x, y)\) and \(\chi (x, y)\). If this number exceeds \(20 \%\) of the window size (\(winSz \times winSz\)) (line 17 in Algorithm 2), then we mark this pixel as a text pixel, otherwise it is marked as a background pixel. The value of winSz is taken as \((\text {stroke width}/2 )\). The \(\text {stroke width} \) is calculated by using the approach presented in [3]. The recovered and background separated image is shown in Fig. 3d and a zoomed portion is shown in Fig. 3e.

Fig. 3.
figure 3

(a) Reasons of character deformation (b) The binary image by the Savoula technique (c) The binary image by the dynamic window based technique (d) The text separated image after reconstruction (e) Zoomed portion after image reconstruction.

4 Results and Discussion

In this section, we will present and discuss the obtained results on a dataset of university diplomas.

4.1 Dataset

To evaluate the proposed algorithm, we have created the French university diploma dataset described in the following.

4.1.1 Diploma Data Set

A total of 40 images were scanned in color and gray level (denoted as Dip_Col and Dip_Gray in Table 2) by a Fujitsu scanner in 300 (20 images) and 600 (20 images) DPI (of size is \(7000 \times 5000\) pixels approx.).

4.1.2 Ground Truth Preparation

Due to no prior availability of a dataset, we have created a dataset and a corresponding GT. Due to the high image size, it is cumbersome to create the GT manually. So, we adapted a semi-manual approach by manually selecting the range of the gray-level threshold for foreground/text pixels.Footnote 7 Then the obtained images are manually corrected by the GIMP software.Footnote 8

Table 2. Results on the university diploma dataset. The best results of each metric are highlighted in bold.
Fig. 4.
figure 4

(a) The issues with the prepared GT (b) Top: One example of GT image from the DIBCO dataset, Bottom: A clustered image (sure and confused text pixels) (c) Top: Only sure text pixels from the clustered image, Bottom: After the image reconstruction.

4.2 Evaluation Protocol

We have used the F-Measure (FM), the Peak Signal to Noise Ratio (PSNR) and the Distance Reciprocal Distortion Metric (DRD) as evaluation metrics (the same as in the DIBCO competition [11]). It can be seen from Table 2, that the performance of our technique is quite promising and achieves an average accuracy (F-Measure) of \(93\%\) on four parts of the complete data set. The accuracies could be further improved by a better preparation of the GT (mainly by a proper removal of the background pixels because our algorithm classifies correctly the background pixels). It can be seen from Fig. 4a that the GTs are not properly cleaned/corrected, which affects the statistical result. The proposed technique performed better than the classical (e.g. Niblack, Savoula and Wolf-Jolin’s [11]) binarization techniques. However, among these three classical binarization techniques, the Savoula binarization performed comparatively better than the others. But as mentioned before, these classical binarization techniques are mainly unable to remove the centered decorated background (i.e. the couronne) (see the Savoula binarization results in Fig. 3b and 5a).

We also compared the proposed approach with some more recent state of the art binarization techniques. A dynamic window based local thresholding approach is proposed by Bataineh et al. [2]. The comparative results shown in Table 2 of this algorithm are not very promising. It can be seen in Fig. 3c that this technique is unable to properly remove the decorated background graphics. Our proposed technique has also outperformed the popularly known binarization technique by Gatos et al. [5], whose statistical accuracy is close to the one of Savoula et al. [11] (see Table 2).

Using structural symmetry of pixels (SSP), we recently proposed a Fuzzy C-Means clustering based binarization technique [10]. Although the accuracy shown in Table 2 is good, but our new method outperforms it by reasonable margin. As shown in Fig. 5b, the previous approach misses character pixels which deforms the shape of characters and it is also unable to properly remove the background couronne. Another SSP based technique is proposed by Jia et al. [9] which has listed some of the best accuracies on the DIBCO data sets [11]. This technique also failed to properly remove the decorated background from the diplomasFootnote 9 (see Fig. 5c and Table 2). Our algorithm also has outperformed this technique with a suitable margin.

Table 3. Time required (in seconds) by the different algorithms on a diploma image of \(7016 \times 4960\) pixels.

The computational time of the different algorithms is presented in Table 3. Due to the implementation facility, we have implemented the prototype of this algorithm using Matlab and C++.Footnote 10 In comparison with other state-of-the-art binarization algorithms e.g. Niblack [11], Gatos [5] etc., our technique takes more computational time, but it is able to outperform these state-of-the-art algorithms in terms of accuracy. While even using Matlab (slower than C++ or the Python language) for implementing one portion of our algorithm, it is a lot more faster than the algorithm of Jia et al. [9]Footnote 11 and our previous work [10].

Fig. 5.
figure 5

(a) The result of Savoula binarization (b) Some results of our previous binarization technique [10] (c) The binarization result by Jia et al. [9]

5 Conclusion

In this paper, we have proposed a technique to remove the background and to obtain texts from French university diplomas. Our proposed method performs satisfactory well on diploma images (see Table 3). All experiments are done by using an Intel i7-8850H CPU, 32 GB RAM (in C++ and Matlab R2018a). It can be concluded that the proposed approach performs satisfactorily well on diploma images compared to many other state of the art techniques.