Building Super-Resolution Image Generator for OCR Accuracy Improvement

Peng, Xujun; Wang, Chao

doi:10.1007/978-3-030-57058-3_11

Xujun Peng¹¹ &
Chao Wang¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12116))

Included in the following conference series:

International Workshop on Document Analysis Systems

2075 Accesses

Abstract

Super-resolving a low resolution (LR) document image can not only enhance the visual quality and readability of the text, but improve the optical character recognition (OCR) accuracy. However, even despite the ill-posed nature of image super-resolution (SR) problem, how do we treat the finer details of text with large upscale factors and suppress noises and artifacts at the same time, especially for low quality document images is still a challenging task. Thus, in order to boost the OCR accuracy, we propose a generative adversarial network (GAN) based framework in this paper, where a SR image generator and a document image quality discriminator are constructed. To obtain high quality SR document image, multiple losses are designed to encourage the generator to learn the structural properties of texts. Meanwhile, the quality discriminator is trained based on a relativistic loss function. Based on the proposed framework, the obtained SR document images not only maintain the details of textures but remove the background noises, which achieve better OCR performance on the public databases. The source codes and pre-trained models are available at https://gitlab.com/xujun.peng/doc-super-resolution.

You have full access to this open access chapter, Download conference paper PDF

Improving Text Image Super-Resolution Using Optimal Transport

Scene Text Image Super-Resolution in the Wild

Application of Super Resolution for Optical Character Recognition in Low Quality Images

Keywords

1 Introduction

Nowadays, hand-held image capturing devices are prevalent. As a result, massive amount of document images are obtained in our day life, which urges the OCR techniques to facilitate the information retrieval from these rich resources. However, most OCR systems are built upon the document images with high qualities and resolutions, which cannot be guaranteed for hand-held captured images under the uncontrolled environments. Generally, document images acquired by hand-held devices are easily degraded by different types of blurs and noises. This degradation is even worse if it takes into account the fact that the text resolution is low. This phenomenon can be observed in Fig. 1(a), where the OCR performance is damaged by a noisy/blurry low resolution document image.

Document image super resolution, which aims to restore the high resolution (HR) document image from one or more of its LR counterparts by reducing the noises and maintaining the sharpness of strokes, can not only enhances the document image’s perceptual quality and readability, but can be used as an effective tool to improve the OCR accuracy [26]. However, as can be seen from the image capturing process that is illustrated in Eq. 1:

$$\begin{aligned} \mathbf {y} = (\mathbf {x} \otimes \mathbf {k}) \downarrow _s + \mathbf {n}, \end{aligned}$$

(1)

where $\mathbf {y}$ is the observed LR image, $\mathbf {x} \otimes \mathbf {k}$ is the convolutional operation between unknown HR image (normally the noise free document image) and degradation kernel, $\downarrow _s$ is the downsampling operation with scale of s and $\mathbf {n}$ is the additive white Gaussian noises introduced by capturing devices, there are infinite HR images $\mathbf {x}$ satisfying Eq. 1 given LR image $\mathbf {y}$, such that it is always infeasible to find the proper and efficient mapping from LR image to HR image [2, 40].

In order to tackle this challenging problem, many researches have been conducted which can be roughly categorized as two main groups: interpolation based and learning based methods. Interpolation based SR methods, such as bicubic interpolation or Markov random field (MRF) based smoothing [18], are simple and efficient but cannot provide plausible results, especially with large upscale factors. The learning based SR systems learn the mapping between LR images/patches and the corresponding HR images/patches and apply the learnt correlations to infer the HR image. Recently, deep learning approaches have produced more favorable results for computer vision tasks, which also dominated the research direction of SR.

In this paper, we focus our research on SR problem for document images to improve the OCR accuracy of them. To achieve this goal, we propose a generative adversarial network (GAN) based framework for SR where the generator is applied to produce the HR document images based on its corresponding LR images, and the discriminator which is trained by a relativistic loss function is employed to distinguish LR and HR images. In Fig. 1(b), a SR document image generated by the proposed system is demonstrated from which a higher OCR accuracy is obtained comparing to its LR counterpart. In summary, our main contributions are three folds:

We propose to use multiscale structural similarity (M-SSIM) loss instead of mean squared error (MSE) loss to train the super-resolution document image generator, which can effectively capture the structural properties for text in the document images;
Inspired by the transformer employed in the machine translation (MT) system [36], the spatial attention layer is applied in the proposed generator to boost super-resolution performance;
Based on the GAN, an end-to-end document image super-resolution system is designed, which is not depended on the annotated or aligned low-resolution/high-resolution image pairs for training but improves the OCR accuracy on public datasets.

We organize the rest of this article as follows. In Sect. 2, we introduce the related work for image SR and document image OCR accuracy improvements, which is followed by the proposed document image SR approach in Sect. 3. The experiments setup and analysis of this research is covered in Sect. 4. And we conclude our work in Sect. 5.

2 Related Work

2.1 OCR Improvements

Generally, in order to improve the performance of OCR, three types of approaches are carried out. One trend is to use different kinds of preprocessing methods to improve the quality of document images which include many simple manipulations, such as noise removal, image enhancement, deskew, dewarping, etc. [1, 3, 8, 34].

Another types of approaches aim to improve the recognition capability of OCR. The early OCR engines are mostly segmentation-based, which require a sophisticated segmentation algorithm to guarantee the performance of OCR [33]. However, in most applications, it is hard or impossible to segment text line into single character, especially for images with low quality or handwritten text. Thus, instead of relying on models for individual characters, segmentation-free OCR engines consider entire text lines as sequential signals and encode them into a single model. For example, Decerbo et al. used hidden markov models (HMM) to model the handwritten text using a slide window strategy to convert 2D text line image into 1D signal [4, 5]. One potential problem of HMM based OCR is that it uses hand-crafted features, which requires domain knowledge to design the features and degrades the performance. So the modern OCR engines tend to utilize recurrent neural network (RNN) combined with CNN to automatically extract features for text line image [29].

The other researches attempt to apply post-processing techniques to correct OCR’s outputs. In [11], Jean-Caurant et al. proposed to use lexicographical-based similarity to re-order the OCR outputs and build a graph connecting similar named entities to improve the OCR performance for names. By employing ivector features to estimate the OCR accuracy for each text line, Peng et al. rescored the OCR outputs and used a lattice to correct the outputs [27]. In [39], Xu and Smith designed an approach that detected duplicated text in OCR’s outputs and performed a consensus decoding combined with a language model to improve the OCR’s accuracy. Inspired by the idea from machine translation, Mokhtar et al. applied a neural machine translation approach to convert OCR’s initial outputs to final corrected outputs [20].

2.2 Document Image Super-Resolution

As one of the powerful image processing methods, which can effectively enhance the image’s quality, image super-resolution (sometimes it is also called image restoration) has been a mainstream research direction for a long time. Recently, document image super-resolution also gains increasing attentions from research community. In [22], a selective patch processing scheme was proposed by Nayef et al., where the patches with high variance were reconstructed by learned model but other patches were interpolated by bicubic approach to ensure the efficiency and accuracy. To learn the mapping relation between noisy LR document patches and HR patches, Walha et al. designed an textual image resolution enhancement framework using both online and offline dictionaries [37].

More the state-of-the-art document super-resolution approaches prefer deep learning based frameworks. In ICDAR2015 Competition on Text Image Super-Resolution [28], Dong et al. modified a SR convolutional neural network (CNN), which was originally applied for natural image SR, for the document image SR and won the top one in this competition [7]. Based on the similar idea, Su et al. employed SRGAN [17] on the document images to improve the accuracy of OCR [35]. SRGAN was also used by Lat and Jawahar in their work, where the SR document image generated by the GAN was combined with bilinear interpolated image to obtain the final SR document image [16]. To avoid the loss of textual details from denoising steps prior to the restoration, Sharma et al. suggested a noise-resilient SR framework to boost OCR performance, where super-resolution and denoising were performed simultaneously based on stacked sparse de-noising auto-encoder and coupled deep convolutional auto-encoder [31]. This idea was also applied in Fu et al.’s approach, where multiple detail-preserving networks were stacked to extract detailed features from a set of LR document images during the up-scaling process [9]. In order to overcome the severe degradation problem for historical documents, Nguyen et al. proposed a character attention GAN to restore degraded characters in old documents so OCR engine can improve its accuracy [24]. In [21], Nakao et al. trained two SR-CNNs on character dataset and ImageNet dataset separately and combined the outputs from these two networks to obtain the SR document images. Unlike other approaches that only image based criteria were used to guide the training of neural network, Sharma et al. introduced an end-to-end DNN framework which integrated the document image preprocessing and OCR into a single pipeline to boost OCR performance [32]. In this framework, they trained a GAN for de-noising which was followed by a deep back projection network (DBPN) for SR task and a bidirectional long short term memory (BLSTM) for OCR.

3 Proposed Approach

3.1 Overall System Architecture

In this paper, our aim is to improve the OCR accuracy of the document image by solving the SR problem. To this end, a generative adversarial network based super-resolution system is proposed, where the input and output images have the same size but with different resolution. Although in real applications, lots of degraded LR document images are available, it is hard to find the corresponding HR images for the supervised training. Thus, in this work, a two-stage training scheme is carried out, where in the first stage, a small amount of aligned training samples were used to pre-train the CNN based generator, and large amount of training samples of LR document images were used to train the generator with the help of the discriminator in the second stage.

In Fig. 2, the overall architecture of the proposed SR system is illustrated, where the proposed super-resolution document image generator $\mathcal {G}_\theta $ is composed by a feature extractor and a image generator, which is connected by the visual attention layers. In the framework of GAN, a discriminator $\mathcal {D}_\xi $ is also designed in our work which distinguishes the generated super-resolution document images from high resolution images. To train the proposed super-resolution image generator, the structural similarity loss is applied to guide the training for both generator and discriminator, which is described in Sect. 3.4 in details.

3.2 Image Generator $\mathcal {G}_\theta $

To super resolve the LR images, many deep learning approaches use a stack of fully connected convolutional layers (such as ResNet block [10]) to reconstruct the images, where the networks’ inputs are normally enlarged LR images with $4{\times }$ upscaling factor by bicubic interpolation and each convolutional layer has the same input size [13, 17].

Unlike these methods which do not employ internal up-sampling and down-sampling for CNN layers, in our work, we apply the idea of encoder-decoder, which is widely used for semantic labeling [25], to accomplish the super-resolution task. In this architecture, the encoder is used to extract features of image in different scales and decoder is utilized to reconstruct the super resolved images. Particularly, we employ a set of CNN layers along with the batch normalization and LeakyReLU layers to construct the feature extractor. To avoid using pooling layer for down-sampling operation, we apply the stride of 2 for each CNN layer in feature extractor. To build the image reconstructor, the same amount of transposed convolutional layers are used, which forms a mirrored version of feature extractor. To overcome the vanishing gradient problem that is easily happened for the very deep neural networks, the shortcuts of corresponding layers between feature extractor and image reconstructor are applied, which effectively utilizes the extracted features in the image reconstruction process [30].

Additionally, to encourage the proposed SR document image generator to focus on those features contributing more for the details of SR image, the visual self-attention layers are introduced and applied to connect the feature extractor and image reconstructor. Originated from machine translation (MT) [36], self-attention technique has been successfully applied for many image/vision tasks [41]. Formally, given an image or feature map $\mathbf {x}$ whose size is $n \times n$, the self-attention model can be expressed as:

$$\begin{aligned} A(\mathbf {q,k,v}) = \text {softmax} \left( \mathbf {q(x)} \mathbf {k(x)^T} \right) \mathbf {v(x)}, \end{aligned}$$

(2)

where $\mathbf {q}(\cdot ), \mathbf {k}(\cdot )$ and $\mathbf {v}(\cdot )$ are vectorized features of $\mathbf {x}$ in three feature spaces, whose sizes are $n^2 \times 1$. Theoretically, $\mathbf {q(x)} \mathbf {k(x)}^T$ calculates the correlations between pixels within the image to find which areas are important for the image, which is used to re-weight $\mathbf {v(x)}$. In our implementation, we use $3 \times 3$ convolutions to calculate $\mathbf {q(x)}$ and $\mathbf {k(x)}$ with reduced feature channels, but employ $1 \times 1$ convolution to compute $\mathbf {v(x)}$ to balance the accuracy and efficiency.

Based on the obtained self-attention, the output of self-attention layer can be calculated by:

$$\begin{aligned} \mathbf {o} = \mathbf {h} \left( M \big (A(\mathbf {q,k,v}) \big ) \right) + \mathbf {x}, \end{aligned}$$

(3)

where $M(\cdot )$ is the operation to convert $\text {A}(\mathbf {q,k,v})$ to a $n \times n$ matrix and $\mathbf {h}(\cdot )$ is the $1 \times 1$ convolution operation.

The structure of the proposed SR document image generator is demonstrated in Fig. 3.

3.3 Quality Assessor $\mathcal {D}_\xi $

Under the framework of GAN, the discriminator $\mathcal {D}_\xi $ is designed to differentiate LR document images from HR document images, which guide the training of image generator $\mathcal {G}_\theta $ based on their visual qualities. In this work, the quality assessor $\mathcal {D}_\xi $ is built by 4 ResNet blocks, where each ResNet block is followed by a LeakyReLU layer and a max-pooling layer for down-sampling. The last three layers for the quality assessor $\mathcal {D}_\xi $ are three fully connected layers whose size are 1024, 512 and 1, respectively.

To train the $\mathcal {D}_\xi $, the document images produced by the proposed image generator $\mathcal {G}_\theta $ and the ideally HR images are fed into the discriminator for training, where the dissimilarity or loss between these two type of images are calculated and used to train image generator $\mathcal {G}_\theta $ consequently.

3.4 Loss Functions for Training

Given a LR document image $\mathbf {y}$ and its HR counterpart $\mathbf {x}$, the goal of training for image generator $\mathcal {G}_\theta $ is to search the optimized parameter $\theta $ such that:

$$\begin{aligned} \bar{\theta }= \arg \min _\theta KL \left( \mathcal {G}_\theta (\mathbf {y}) \Vert \mathbf {x} \right) , \end{aligned}$$

(4)

where $KL(\cdot )$ is the Kullback-Leibler divergence which measures the dissimilarity between super resolved image $\mathcal {G}_\theta (\mathbf {y})$ and HR image $\mathbf {x}$, and $\uparrow _s$ means the upscaling operation with factor s. Ideally, Eq. 4 is minimized when SR document image $\mathbf {y} \uparrow _s$ has the same distribution as HR image $\mathbf {x}$ w.r.t. resolution/quality. In the scenario of GAN, the Kullback-Leibler divergence between these two types of images are calculated by maximizing the object function of quality assessor $\mathcal {D}_\xi $.

Although GAN can learn the SR image generator $\mathcal {G}_\theta $ by using large amount of unlabeled data, it can be saturated easily because of GAN’s large capacity and huge search space. Thus, in this work, we carry out a two-phases training scheme to train the $\mathcal {G}_\theta $. In the first training phase, we apply the supervised training to obtain the initial parameter $\theta $ for $\mathcal {G}_\theta $ such that it is close to the optimized $\mathcal {G}_\theta $, where we collect a small amount of HR images along with their LR counterparts for training. Unlike the conventional approaches that mean squared error (MSE) is used as the loss function, we introduce the multiscale structural similarity (M-SSIM) loss to train the proposed model [38]. Given two image patch $\mathbf {x}$ and $\mathbf {y}$ with the same size, their M-SSIM is calculated by:

$$\begin{aligned} \text {M-SSIM}(\mathbf {x}, \mathbf {y}) = l(\mathbf {x}, \mathbf {y})^\alpha \cdot \prod _{j=1}^M c_j(\mathbf {x}, \mathbf {y})^{\beta _j} s_j(\mathbf {x}, \mathbf {y})^{\gamma _j}, \end{aligned}$$

(5)

where $l(\cdot , \cdot )$ is the luminance measure, $c_j(\cdot , \cdot )$ and $s_j(\cdot , \cdot )$ are contrast measure and structure measure with scale j, the hyper-parameters $\alpha , \beta $ and $\gamma $ control the influence of these three measures. The detailed definition of these measure can be found in [38].

Based on M-SSIM, we define the M-SSIM loss between two image patches $\mathbf {x}$ and $\mathbf {y}$ as:

$$\begin{aligned} \mathcal {L}_{M-SSIM} = \frac{1-\text {M-SSIM}(\mathbf {x}, \mathbf {y})}{2} \end{aligned}$$

(6)

As can be seen from Eq. 5 that the advantage of M-SSIM loss over MSE loss is that MSE only calculates the absolute errors by comparing the differences between pixels, but M-SSIM loss concerns more on the perceived quality by using the statistics between two images.

In the second training phase, large amount of unlabeled training samples are used with the combination of LS-GAN [19] and Ra-GAN [12] to train the generator $\mathcal {G}_\theta $, where the losses for generator $\mathcal {G}_\theta $ and discriminator $\mathcal {D}_\xi $ are defined by:

$$\begin{aligned} \min \mathcal {L}_{\mathcal {G}_\theta }&= \frac{1}{2} \mathbb {E}_{\mathbf {y} \sim \mathbb {P}(\mathbf {y})} \left[ \Big ( \delta \big (\mathcal {D}_\xi (\mathcal {G}_\theta (\mathbf {y})) - \mathbb {E}_{\mathbf {x} \sim \mathbb {P}(\mathbf {x})} [\mathcal {D}_\xi (\mathbf {x})] \big ) - a \Big )^2 \right] \end{aligned}$$

(7)

$$\begin{aligned} \min \mathcal {L}_{\mathcal {D}_\xi }&= \frac{1}{2} \mathbb {E}_{\mathbf {x} \sim \mathbb {P}(\mathbf {x})} \left[ \Big (\delta \big (\mathcal {D}_\xi (\mathbf {x}) - \mathbb {E}_{\mathbf {y} \sim \mathbb {P}(\mathbf {y})}[\mathcal {D}_\xi (\mathcal {G}_\theta (\mathbf {y}))] \big ) - b \Big )^2 \right] \nonumber \\&+ \frac{1}{2} \mathbb {E}_{\mathbf {y} \sim \mathbb {P}(\mathbf {y})} \left[ \Big ( \delta \big (\mathcal {D}_\xi (\mathcal {G}_\theta (\mathbf {y})) - \mathbb {E}_{\mathbf {x} \sim \mathbb {P}(\mathbf {x})}[\mathcal {D}_\xi (\mathbf {x})] \big )- c \Big )^2 \right] \end{aligned}$$

(8)

where $\mathbb {E}[\cdot ]$ means the expectation, $\delta (\cdot )$ is the Sigmoid function, $\mathbf {x}$ is the HR training image and $\mathbf {y}$ is the LR training sample, and we take $a=1, b=1$ and $c=0$ in our work.

As we can see from Eq. 7 and Eq. 8, unlike the conventional losses designed for GAN, Ra-GAN uses relative average difference from the discriminator between HR and LR training samples within the training batch, which provides more stable training process. And the use of LS-GAN helps the proposed system avoid saturation during the GAN’s training phase.

4 Experiments and Results

4.1 Datasets and Evaluation Metrics

In order to train the proposed super-resolution document generator, SmartDoc-QA dataset [23] and B-MOD dataset [14] were employed.

SmartDoc-QA dataset contains a total of 4260 camera captured document images whose size are $3096 \times 4128$. All these images were obtained based on 30 noise-free documents through two types of mobile phone cameras with different types of distortions and degradations, such as uneven lighting, out-of-focus blur, etc. In our experiment, we selected 90% images from this set for the supervised training in the first training phase, and the remaining 10% images were used for tuning purpose.

Prior to the supervised training, the captured images for SmartDoc-QA dataset were rectified by using RANSAC based perspective transformation, where the matching points between each LR image and the corresponding HR image were extracted by SURF features.

In B-MOD dataset, the document images were captured using the same manner as the SmartDoc-QA dataset but with larger number of samples, where 2,113 unique pages from random scientific papers were collected and photographed, which caused a total of 19725 images obtained by using 23 different mobile devices. In this dataset, although the rectified document images were available but the correspondences between these rectified images and their HR counterparts were not provided. Thus, in our experiments, we used the HR and LR images in this dataset to accomplish the unsupervised training for the GAN.

Because there is no public benchmark dataset available for the document image super-resolution task, we used SoC dataset [15] to evaluate the performance of the proposed document image super-resolution system. In SoC dataset, a total of 175 document images were captured from 25 “ideally clean” documents with different focus lengths. For each image, the ground truth text was also provided.

In Fig. 4, sample images from these three datasets are demonstrated.

To measure the performance of image super-resolution, peak signal-to-noise ratio (PSNR) is a widely used metric where higher value means better performance. However, PSNR only considers the difference between the pixel values at the same positions but ignores human visual perception, which causes this approach to provide poor performance in representing the quality of SR images. Furthermore, as we are more interested in improving the OCR performance through the SR document image and expect the resolved images not only have higher resolution but contain less noises than the original LR images, thus we use character error rate (CER) of OCR outputs to measure the proposed document image super-resolution system in this work.

4.2 System Training

As described in Sect. 3.2 and Sect. 3.3, we train the proposed super-resolution generator in two training phases.

In the first training phase, the supervised training was carried out, where the training data from SmartDoc-QA dataset was employed. To train the generator $\mathcal {G}_\theta $, we randomly cropped small patches from the rectified camera captured document images to feed into the generator. The patches of the same location from the corresponding noise-free images were also cropped and used as the ground truths to guide the training of generator $\mathcal {G}_\theta $. In our experiment, the patches’ size were $256 \times 256$.

In the second training phase, the rectified camera captured images from B-MOD dataset were used as the LR training samples, and 2,113 clean images from this dataset, along with other 3,554 in-house noise free document images were utilized as the HR training samples. These unlabeled training images were randomly cropped and fed into the proposed system to train the GAN, and the relativistic average losses were used to guide the training of generator $\mathcal {G}_\theta $ and $\mathcal {D}_\xi $.

In this experiment, the Adam optimizer was used in both supervised and unsupervised training phases with initial learning rate of $10^3$. The training was stopped until the improvement of training loss was smaller than a threshold.

4.3 Evaluation of Generator $\mathcal {G}_\theta $

To evaluate the proposed document image super-resolution system with different losses and neural network architectures, we trained multiple generator $\mathcal {G}_\theta $ with different settings. The baseline system we obtained was a super-resolution document image generator trained using supervised training only with MSE loss, where the attention layers were also removed from the generator. In the meantime, a generator of the same architecture but with M-SSIM loss was also trained based on supervised training strategy. In the third SR document image generator, we added the attention layers into the network as shown in Fig. 3. Based on the third generator, we retrained it in the framework of GAN and obtained the final SR document image generator.

To assess the performance of these different systems, the 175 images from SoC dataset were applied. By considering these degraded images as enlarged images from its LR counterparts, we used them as the inputs of SR image generator $\mathcal {G}_\theta $ directly to produce the SR document images. With the obtained SR images, Tesseract OCR engine was used to retrieve their transcripts and the CERs were also calculated. In Table 1, the CERs of the original image and the images generated from the SR system we implemented are listed. It can be observed that the CER of the original degraded document images are as high as $22.03\%$. By using CNN based SR image generator, the CERs of the produced document images are decreased to $17.00\%$. With the introduced M-SSIM loss and attention layers, we can see the CERs of those restored images are decreased further to $14.95\%$. From the last row of this table, it can be seen that the images generated by the SR system that was trained based on GAN achieve the lowest CER $14.40\%$, which shows the effectiveness of the proposed M-SSIM loss, attention layers and the GAN training scheme for the SR system.

Table 1. OCR performances for different SR document image generators trained in our work, where the original image is taken as the system input.

Full size table

4.4 Comparison with the State-of-the Arts

To measure the OCR performance of the proposed SR document image generating system and other state-of-the-art SR approaches, a comparison experiment was implemented where the images in the SoC dataset were downsampled by the factor of 4 initially. Then, those down-sized images were super resolved by using BiCubic interpolation, SRCNN super-resolution [6], SRGAN super-resolution [17], and the proposed super-resolution approach, respectively. In this experiment, we retrained SRGAN using the same training data from SmartDoc-QA and B-MOD datasets. Need to note that because the proposed SR document image generator $\mathcal {G}_\theta $ takes the input whose size is the same as the output, we upsampled the down-sized images using the nearest neighbor interpolation prior to send them to the proposed generator. The obtained SR document images were transcribed by using Tesseract OCR engine, whose CERs are reported in Table 2.

Table 2. Comparison between the proposed SR approach and the state-of-the-art methods w.r.t. OCR performances, where the original image is downsampled 4 times before sending as the system input.

Full size table

As can be seen from this table, for document images which were down-sampled by factor of 4, the resolved images obtained by the proposed methods achieved almost the same OCR accuracy as the original images. But the images produced by other methods, even they were visually close to the original image still had worse OCR performance.

In Fig. 5, the original document image and the restored SR images by different approaches are also illustrated. We can observe from this figure that unlike the SR images generated by other methods, which still have lots of background noises and uneven/bad lighting effects, the image produced by the proposed SR method removes all the lighting effects and provides clean backgrounds. The main reason for this advantage is that the proposed SR system uses GAN to train the generator $\mathcal {G}_\theta $ which does not rely on HR/LR image training pairs. So we can collect large amount of ideally noise free document as the targets to guide the training for discriminator and generator, which makes the obtained generator can produce high quality images with less noises.

5 Conclusion

In this work, we propose an end-to-end neural network for SR document image generation, where the image generator is composed by a encoder and a decoder, which are connected by the proposed attention layers. To obtain the SR document generator, a two training phase strategy is implemented, where in the first stage, the supervised training is carried out. And in this stage, a M-SSIM loss is designed which effectively improves the performance of SR w.r.t. OCR accuracy. In the second training phase, a combination of LS-GAN and Ra-GAN is applied, which largely relaxes the restriction for the training samples. So we can collect a large number of unlabeled training samples, including ideally noise free images, to train the proposed SR system. The experimental results show the effectiveness of the proposed SR methods based on the OCR performances. In the future, we will explore other loss functions, including OCR based loss functions to guide the training of SR image generator.

References

Agrawal, M., Doermann, D.: Stroke-like pattern noise removal in binary document images. In: 2011 International Conference on Document Analysis and Recognition, pp. 17–21 (2011)
Google Scholar
Anwar, S., Khan, S., Barnes, N.: A deep journey into super-resolution: a survey. arXiv preprint arXiv:1904.07523 (2019)
Caner, G., Haritaoglu, I.: Shape-DNA: effective character restoration and enhancement for Arabic text documents. In: 2010 20th International Conference on Pattern Recognition, pp. 2053–2056 (2010)
Google Scholar
Cao, H., Natarajan, P., Peng, X., Subramanian, K., Belanger, D., Li, N.: Progress in the Raytheon BBN Arabic offline handwriting recognition system. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 555–560 (2014)
Google Scholar
Decerbo, M., Natarajan, P., Prasad, R., MacRostie, E., Ravindran, A.: Performance improvements to the BBN Byblos OCR system. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), vol. 1, pp. 411–415 (2005)
Google Scholar
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
Article Google Scholar
Dong, C., Zhu, X., Deng, Y., Loy, C.C., Qiao, Y.: Boosting optical character recognition: a super-resolution approach. CoRR abs/1506.02211 (2015). http://arxiv.org/abs/1506.02211
Fawzi, M., et al.: Rectification of camera captured document images for camera-based OCR technology. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1226–1230 (2015)
Google Scholar
Fu, Z., et al.: Cascaded detail-preserving networks for super-resolution of document images. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 240–245 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Jean-Caurant, A., Tamani, N., Courboulay, V., Burie, J.: Lexicographical-based order for post-OCR correction of named entities. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1192–1197 (2017)
Google Scholar
Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard GAN. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=S1erHoR5t7
Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image super-resolution. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1637–1645 (2016)
Google Scholar
Kiss, M., Hradis, M., Kodym, O.: Brno mobile OCR dataset. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1352–1357 (2019)
Google Scholar
Kumar, J., Ye, P., Doermann, D.: A dataset for quality assessment of camera captured document images. In: Camera-Based Document Analysis and Recognition, pp. 113–125 (2014)
Google Scholar
Lat, A., Jawahar, C.V.: Enhancing OCR accuracy with super resolution. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3162–3167 (2018)
Google Scholar
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114 (2017)
Google Scholar
Lu, J., Min, D., Pahwa, R.S., Do, M.N.: A revisit to MRF-based depth map super-resolution and enhancement. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 985–988 (2011)
Google Scholar
Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2813–2821 (2017)
Google Scholar
Mokhtar, K., Bukhari, S.S., Dengel, A.: OCR error correction: state-of-the-art vs an NMT-based approach. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 429–434 (2018)
Google Scholar
Nakao, R., Iwana, B.K., Uchida, S.: Selective super-resolution for scene text images. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 401–406 (2019)
Google Scholar
Nayef, N., Chazalon, J., Gomez-Krämer, P., Ogier, J.: Efficient example-based super-resolution of single text images based on selective patch processing. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 227–231 (2014)
Google Scholar
Nayef, N., Luqman, M.M., Prum, S., Eskenazi, S., Chazalon, J., Ogier, J.: SmartDoc-QA: a dataset for quality assessment of smartphone captured document images - single and multiple distortions. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1231–1235 (2015)
Google Scholar
Nguyen, K.C., Nguyen, C.T., Hotta, S., Nakagawa, M.: A character attention generative adversarial network for degraded historical document restoration. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 420–425 (2019)
Google Scholar
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528 (2015)
Google Scholar
Ohkura, A., Deguchi, D., Takahashi, T., Ide, I., Murase, H.: Low-resolution character recognition by video-based super-resolution. In: 10th International Conference on Document Analysis and Recognition, pp. 191–195 (2009)
Google Scholar
Peng, X., Cao, H., Natarajan, P.: Boost OCR accuracy using iVector based system combination approach. In: Document Recognition and Retrieval XXII, vol. 9402, pp. 116–123 (2015)
Google Scholar
Peyrard, C., Baccouche, M., Mamalet, F., Garcia, C.: ICDAR2015 competition on text image super-resolution. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1201–1205 (2015)
Google Scholar
Rawls, S., Cao, H., Kumar, S., Natarajan, P.: Combining convolutional neural networks and LSTMS for segmentation-free OCR. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 155–160 (2017)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Sharma, M., Ray, A., Chaudhury, S., Lall, B.: A noise-resilient super-resolution framework to boost OCR performance. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 466–471 (2017)
Google Scholar
Sharma, M., et al.: An end-to-end trainable framework for joint optimization of document enhancement and recognition. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 59–64 (2019)
Google Scholar
Smith, R., Antonova, D., Lee, D.S.: Adapting the tesseract open source OCR engine for multilingual OCR. In: Proceedings of the International Workshop on Multilingual OCR, pp. 1:1–1:8 (2009)
Google Scholar
Stamatopoulos, N., Gatos, B., Pratikakis, I., Perantonis, S.J.: A two-step dewarping of camera document images. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 209–216 (2008)
Google Scholar
Su, X., Xu, H., Kang, Y., Hao, X., Gao, G., Zhang, Y.: Improving text image resolution using a deep generative adversarial network for optical character recognition. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1193–1199 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008 (2017)
Google Scholar
Walha, R., Drira, F., Lebourgeois, F., Garcia, C., Alimi, A.M.: Handling noise in textual image resolution enhancement using online and offline learned dictionaries. Int. J. Doc. Anal. Recognit. (IJDAR) 21(1), 137–157 (2018)
Article Google Scholar
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thirty-Seventh Asilomar Conference on Signals, Systems Computers, 2003, vol. 2, pp. 1398–1402 (2003)
Google Scholar
Xu, S., Smith, D.: Retrieving and combining repeated passages to improve OCR. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1–4 (2017)
Google Scholar
Yang, W., Zhang, X., Tian, Y., Wang, W., Xue, J.H.: Deep learning for single image super-resolution: a brief review. arxiv abs/1808.03344 (2018)
Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 7354–7363. PMLR (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

ISI, University of Southern California, Marina del Rey, CA, USA
Xujun Peng
LinkedMed Co., Ltd., Spreadtrum Center, Zuchongzhi Road, Shanghai, China
Chao Wang

Authors

Xujun Peng
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xujun Peng .

Editor information

Editors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Autonomous University of Barcelona, Barcelona, Spain
Dimosthenis Karatzas
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, X., Wang, C. (2020). Building Super-Resolution Image Generator for OCR Accuracy Improvement. In: Bai, X., Karatzas, D., Lopresti, D. (eds) Document Analysis Systems. DAS 2020. Lecture Notes in Computer Science(), vol 12116. Springer, Cham. https://doi.org/10.1007/978-3-030-57058-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-57058-3_11
Published: 14 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57057-6
Online ISBN: 978-3-030-57058-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)