Keywords

1 Introduction

Nowadays, hand-held image capturing devices are prevalent. As a result, massive amount of document images are obtained in our day life, which urges the OCR techniques to facilitate the information retrieval from these rich resources. However, most OCR systems are built upon the document images with high qualities and resolutions, which cannot be guaranteed for hand-held captured images under the uncontrolled environments. Generally, document images acquired by hand-held devices are easily degraded by different types of blurs and noises. This degradation is even worse if it takes into account the fact that the text resolution is low. This phenomenon can be observed in Fig. 1(a), where the OCR performance is damaged by a noisy/blurry low resolution document image.

Document image super resolution, which aims to restore the high resolution (HR) document image from one or more of its LR counterparts by reducing the noises and maintaining the sharpness of strokes, can not only enhances the document image’s perceptual quality and readability, but can be used as an effective tool to improve the OCR accuracy [26]. However, as can be seen from the image capturing process that is illustrated in Eq. 1:

$$\begin{aligned} \mathbf {y} = (\mathbf {x} \otimes \mathbf {k}) \downarrow _s + \mathbf {n}, \end{aligned}$$
(1)

where \(\mathbf {y}\) is the observed LR image, \(\mathbf {x} \otimes \mathbf {k}\) is the convolutional operation between unknown HR image (normally the noise free document image) and degradation kernel, \(\downarrow _s\) is the downsampling operation with scale of s and \(\mathbf {n}\) is the additive white Gaussian noises introduced by capturing devices, there are infinite HR images \(\mathbf {x}\) satisfying Eq. 1 given LR image \(\mathbf {y}\), such that it is always infeasible to find the proper and efficient mapping from LR image to HR image [2, 40].

Fig. 1.
figure 1

The portion of degraded LR document image and the corresponding SR image produced by the proposed system. The character error rate (CER) obtained by tesseract OCR engine for the original LR image is \(27.04\%\), whilst the CER for the generated SR image is \(5.59\%\).

In order to tackle this challenging problem, many researches have been conducted which can be roughly categorized as two main groups: interpolation based and learning based methods. Interpolation based SR methods, such as bicubic interpolation or Markov random field (MRF) based smoothing [18], are simple and efficient but cannot provide plausible results, especially with large upscale factors. The learning based SR systems learn the mapping between LR images/patches and the corresponding HR images/patches and apply the learnt correlations to infer the HR image. Recently, deep learning approaches have produced more favorable results for computer vision tasks, which also dominated the research direction of SR.

In this paper, we focus our research on SR problem for document images to improve the OCR accuracy of them. To achieve this goal, we propose a generative adversarial network (GAN) based framework for SR where the generator is applied to produce the HR document images based on its corresponding LR images, and the discriminator which is trained by a relativistic loss function is employed to distinguish LR and HR images. In Fig. 1(b), a SR document image generated by the proposed system is demonstrated from which a higher OCR accuracy is obtained comparing to its LR counterpart. In summary, our main contributions are three folds:

  • We propose to use multiscale structural similarity (M-SSIM) loss instead of mean squared error (MSE) loss to train the super-resolution document image generator, which can effectively capture the structural properties for text in the document images;

  • Inspired by the transformer employed in the machine translation (MT) system [36], the spatial attention layer is applied in the proposed generator to boost super-resolution performance;

  • Based on the GAN, an end-to-end document image super-resolution system is designed, which is not depended on the annotated or aligned low-resolution/high-resolution image pairs for training but improves the OCR accuracy on public datasets.

We organize the rest of this article as follows. In Sect. 2, we introduce the related work for image SR and document image OCR accuracy improvements, which is followed by the proposed document image SR approach in Sect. 3. The experiments setup and analysis of this research is covered in Sect. 4. And we conclude our work in Sect. 5.

2 Related Work

2.1 OCR Improvements

Generally, in order to improve the performance of OCR, three types of approaches are carried out. One trend is to use different kinds of preprocessing methods to improve the quality of document images which include many simple manipulations, such as noise removal, image enhancement, deskew, dewarping, etc. [1, 3, 8, 34].

Another types of approaches aim to improve the recognition capability of OCR. The early OCR engines are mostly segmentation-based, which require a sophisticated segmentation algorithm to guarantee the performance of OCR [33]. However, in most applications, it is hard or impossible to segment text line into single character, especially for images with low quality or handwritten text. Thus, instead of relying on models for individual characters, segmentation-free OCR engines consider entire text lines as sequential signals and encode them into a single model. For example, Decerbo et al. used hidden markov models (HMM) to model the handwritten text using a slide window strategy to convert 2D text line image into 1D signal [4, 5]. One potential problem of HMM based OCR is that it uses hand-crafted features, which requires domain knowledge to design the features and degrades the performance. So the modern OCR engines tend to utilize recurrent neural network (RNN) combined with CNN to automatically extract features for text line image [29].

The other researches attempt to apply post-processing techniques to correct OCR’s outputs. In [11], Jean-Caurant et al. proposed to use lexicographical-based similarity to re-order the OCR outputs and build a graph connecting similar named entities to improve the OCR performance for names. By employing ivector features to estimate the OCR accuracy for each text line, Peng et al. rescored the OCR outputs and used a lattice to correct the outputs [27]. In [39], Xu and Smith designed an approach that detected duplicated text in OCR’s outputs and performed a consensus decoding combined with a language model to improve the OCR’s accuracy. Inspired by the idea from machine translation, Mokhtar et al. applied a neural machine translation approach to convert OCR’s initial outputs to final corrected outputs [20].

2.2 Document Image Super-Resolution

As one of the powerful image processing methods, which can effectively enhance the image’s quality, image super-resolution (sometimes it is also called image restoration) has been a mainstream research direction for a long time. Recently, document image super-resolution also gains increasing attentions from research community. In [22], a selective patch processing scheme was proposed by Nayef et al., where the patches with high variance were reconstructed by learned model but other patches were interpolated by bicubic approach to ensure the efficiency and accuracy. To learn the mapping relation between noisy LR document patches and HR patches, Walha et al. designed an textual image resolution enhancement framework using both online and offline dictionaries [37].

More the state-of-the-art document super-resolution approaches prefer deep learning based frameworks. In ICDAR2015 Competition on Text Image Super-Resolution [28], Dong et al. modified a SR convolutional neural network (CNN), which was originally applied for natural image SR, for the document image SR and won the top one in this competition [7]. Based on the similar idea, Su et al. employed SRGAN [17] on the document images to improve the accuracy of OCR [35]. SRGAN was also used by Lat and Jawahar in their work, where the SR document image generated by the GAN was combined with bilinear interpolated image to obtain the final SR document image [16]. To avoid the loss of textual details from denoising steps prior to the restoration, Sharma et al. suggested a noise-resilient SR framework to boost OCR performance, where super-resolution and denoising were performed simultaneously based on stacked sparse de-noising auto-encoder and coupled deep convolutional auto-encoder [31]. This idea was also applied in Fu et al.’s approach, where multiple detail-preserving networks were stacked to extract detailed features from a set of LR document images during the up-scaling process [9]. In order to overcome the severe degradation problem for historical documents, Nguyen et al. proposed a character attention GAN to restore degraded characters in old documents so OCR engine can improve its accuracy [24]. In [21], Nakao et al. trained two SR-CNNs on character dataset and ImageNet dataset separately and combined the outputs from these two networks to obtain the SR document images. Unlike other approaches that only image based criteria were used to guide the training of neural network, Sharma et al. introduced an end-to-end DNN framework which integrated the document image preprocessing and OCR into a single pipeline to boost OCR performance [32]. In this framework, they trained a GAN for de-noising which was followed by a deep back projection network (DBPN) for SR task and a bidirectional long short term memory (BLSTM) for OCR.

3 Proposed Approach

3.1 Overall System Architecture

In this paper, our aim is to improve the OCR accuracy of the document image by solving the SR problem. To this end, a generative adversarial network based super-resolution system is proposed, where the input and output images have the same size but with different resolution. Although in real applications, lots of degraded LR document images are available, it is hard to find the corresponding HR images for the supervised training. Thus, in this work, a two-stage training scheme is carried out, where in the first stage, a small amount of aligned training samples were used to pre-train the CNN based generator, and large amount of training samples of LR document images were used to train the generator with the help of the discriminator in the second stage.

In Fig. 2, the overall architecture of the proposed SR system is illustrated, where the proposed super-resolution document image generator \(\mathcal {G}_\theta \) is composed by a feature extractor and a image generator, which is connected by the visual attention layers. In the framework of GAN, a discriminator \(\mathcal {D}_\xi \) is also designed in our work which distinguishes the generated super-resolution document images from high resolution images. To train the proposed super-resolution image generator, the structural similarity loss is applied to guide the training for both generator and discriminator, which is described in Sect. 3.4 in details.

Fig. 2.
figure 2

An illustration of the proposed super-resolution document image generation system based on GAN.

3.2 Image Generator \(\mathcal {G}_\theta \)

To super resolve the LR images, many deep learning approaches use a stack of fully connected convolutional layers (such as ResNet block [10]) to reconstruct the images, where the networks’ inputs are normally enlarged LR images with \(4{\times }\) upscaling factor by bicubic interpolation and each convolutional layer has the same input size [13, 17].

Unlike these methods which do not employ internal up-sampling and down-sampling for CNN layers, in our work, we apply the idea of encoder-decoder, which is widely used for semantic labeling [25], to accomplish the super-resolution task. In this architecture, the encoder is used to extract features of image in different scales and decoder is utilized to reconstruct the super resolved images. Particularly, we employ a set of CNN layers along with the batch normalization and LeakyReLU layers to construct the feature extractor. To avoid using pooling layer for down-sampling operation, we apply the stride of 2 for each CNN layer in feature extractor. To build the image reconstructor, the same amount of transposed convolutional layers are used, which forms a mirrored version of feature extractor. To overcome the vanishing gradient problem that is easily happened for the very deep neural networks, the shortcuts of corresponding layers between feature extractor and image reconstructor are applied, which effectively utilizes the extracted features in the image reconstruction process [30].

Additionally, to encourage the proposed SR document image generator to focus on those features contributing more for the details of SR image, the visual self-attention layers are introduced and applied to connect the feature extractor and image reconstructor. Originated from machine translation (MT) [36], self-attention technique has been successfully applied for many image/vision tasks [41]. Formally, given an image or feature map \(\mathbf {x}\) whose size is \(n \times n\), the self-attention model can be expressed as:

$$\begin{aligned} A(\mathbf {q,k,v}) = \text {softmax} \left( \mathbf {q(x)} \mathbf {k(x)^T} \right) \mathbf {v(x)}, \end{aligned}$$
(2)

where \(\mathbf {q}(\cdot ), \mathbf {k}(\cdot )\) and \(\mathbf {v}(\cdot )\) are vectorized features of \(\mathbf {x}\) in three feature spaces, whose sizes are \(n^2 \times 1\). Theoretically, \(\mathbf {q(x)} \mathbf {k(x)}^T\) calculates the correlations between pixels within the image to find which areas are important for the image, which is used to re-weight \(\mathbf {v(x)}\). In our implementation, we use \(3 \times 3\) convolutions to calculate \(\mathbf {q(x)}\) and \(\mathbf {k(x)}\) with reduced feature channels, but employ \(1 \times 1\) convolution to compute \(\mathbf {v(x)}\) to balance the accuracy and efficiency.

Based on the obtained self-attention, the output of self-attention layer can be calculated by:

$$\begin{aligned} \mathbf {o} = \mathbf {h} \left( M \big (A(\mathbf {q,k,v}) \big ) \right) + \mathbf {x}, \end{aligned}$$
(3)

where \(M(\cdot )\) is the operation to convert \(\text {A}(\mathbf {q,k,v})\) to a \(n \times n\) matrix and \(\mathbf {h}(\cdot )\) is the \(1 \times 1\) convolution operation.

The structure of the proposed SR document image generator is demonstrated in Fig. 3.

Fig. 3.
figure 3

The structure of the proposed SR document image generator \(\mathcal {G}_\theta \).

3.3 Quality Assessor \(\mathcal {D}_\xi \)

Under the framework of GAN, the discriminator \(\mathcal {D}_\xi \) is designed to differentiate LR document images from HR document images, which guide the training of image generator \(\mathcal {G}_\theta \) based on their visual qualities. In this work, the quality assessor \(\mathcal {D}_\xi \) is built by 4 ResNet blocks, where each ResNet block is followed by a LeakyReLU layer and a max-pooling layer for down-sampling. The last three layers for the quality assessor \(\mathcal {D}_\xi \) are three fully connected layers whose size are 1024, 512 and 1, respectively.

To train the \(\mathcal {D}_\xi \), the document images produced by the proposed image generator \(\mathcal {G}_\theta \) and the ideally HR images are fed into the discriminator for training, where the dissimilarity or loss between these two type of images are calculated and used to train image generator \(\mathcal {G}_\theta \) consequently.

3.4 Loss Functions for Training

Given a LR document image \(\mathbf {y}\) and its HR counterpart \(\mathbf {x}\), the goal of training for image generator \(\mathcal {G}_\theta \) is to search the optimized parameter \(\theta \) such that:

$$\begin{aligned} \bar{\theta }= \arg \min _\theta KL \left( \mathcal {G}_\theta (\mathbf {y}) \Vert \mathbf {x} \right) , \end{aligned}$$
(4)

where \(KL(\cdot )\) is the Kullback-Leibler divergence which measures the dissimilarity between super resolved image \(\mathcal {G}_\theta (\mathbf {y})\) and HR image \(\mathbf {x}\), and \(\uparrow _s\) means the upscaling operation with factor s. Ideally, Eq. 4 is minimized when SR document image \(\mathbf {y} \uparrow _s\) has the same distribution as HR image \(\mathbf {x}\) w.r.t. resolution/quality. In the scenario of GAN, the Kullback-Leibler divergence between these two types of images are calculated by maximizing the object function of quality assessor \(\mathcal {D}_\xi \).

Although GAN can learn the SR image generator \(\mathcal {G}_\theta \) by using large amount of unlabeled data, it can be saturated easily because of GAN’s large capacity and huge search space. Thus, in this work, we carry out a two-phases training scheme to train the \(\mathcal {G}_\theta \). In the first training phase, we apply the supervised training to obtain the initial parameter \(\theta \) for \(\mathcal {G}_\theta \) such that it is close to the optimized \(\mathcal {G}_\theta \), where we collect a small amount of HR images along with their LR counterparts for training. Unlike the conventional approaches that mean squared error (MSE) is used as the loss function, we introduce the multiscale structural similarity (M-SSIM) loss to train the proposed model [38]. Given two image patch \(\mathbf {x}\) and \(\mathbf {y}\) with the same size, their M-SSIM is calculated by:

$$\begin{aligned} \text {M-SSIM}(\mathbf {x}, \mathbf {y}) = l(\mathbf {x}, \mathbf {y})^\alpha \cdot \prod _{j=1}^M c_j(\mathbf {x}, \mathbf {y})^{\beta _j} s_j(\mathbf {x}, \mathbf {y})^{\gamma _j}, \end{aligned}$$
(5)

where \(l(\cdot , \cdot )\) is the luminance measure, \(c_j(\cdot , \cdot )\) and \(s_j(\cdot , \cdot )\) are contrast measure and structure measure with scale j, the hyper-parameters \(\alpha , \beta \) and \(\gamma \) control the influence of these three measures. The detailed definition of these measure can be found in [38].

Based on M-SSIM, we define the M-SSIM loss between two image patches \(\mathbf {x}\) and \(\mathbf {y}\) as:

$$\begin{aligned} \mathcal {L}_{M-SSIM} = \frac{1-\text {M-SSIM}(\mathbf {x}, \mathbf {y})}{2} \end{aligned}$$
(6)

As can be seen from Eq. 5 that the advantage of M-SSIM loss over MSE loss is that MSE only calculates the absolute errors by comparing the differences between pixels, but M-SSIM loss concerns more on the perceived quality by using the statistics between two images.

In the second training phase, large amount of unlabeled training samples are used with the combination of LS-GAN [19] and Ra-GAN [12] to train the generator \(\mathcal {G}_\theta \), where the losses for generator \(\mathcal {G}_\theta \) and discriminator \(\mathcal {D}_\xi \) are defined by:

$$\begin{aligned} \min \mathcal {L}_{\mathcal {G}_\theta }&= \frac{1}{2} \mathbb {E}_{\mathbf {y} \sim \mathbb {P}(\mathbf {y})} \left[ \Big ( \delta \big (\mathcal {D}_\xi (\mathcal {G}_\theta (\mathbf {y})) - \mathbb {E}_{\mathbf {x} \sim \mathbb {P}(\mathbf {x})} [\mathcal {D}_\xi (\mathbf {x})] \big ) - a \Big )^2 \right] \end{aligned}$$
(7)
$$\begin{aligned} \min \mathcal {L}_{\mathcal {D}_\xi }&= \frac{1}{2} \mathbb {E}_{\mathbf {x} \sim \mathbb {P}(\mathbf {x})} \left[ \Big (\delta \big (\mathcal {D}_\xi (\mathbf {x}) - \mathbb {E}_{\mathbf {y} \sim \mathbb {P}(\mathbf {y})}[\mathcal {D}_\xi (\mathcal {G}_\theta (\mathbf {y}))] \big ) - b \Big )^2 \right] \nonumber \\&+ \frac{1}{2} \mathbb {E}_{\mathbf {y} \sim \mathbb {P}(\mathbf {y})} \left[ \Big ( \delta \big (\mathcal {D}_\xi (\mathcal {G}_\theta (\mathbf {y})) - \mathbb {E}_{\mathbf {x} \sim \mathbb {P}(\mathbf {x})}[\mathcal {D}_\xi (\mathbf {x})] \big )- c \Big )^2 \right] \end{aligned}$$
(8)

where \(\mathbb {E}[\cdot ]\) means the expectation, \(\delta (\cdot )\) is the Sigmoid function, \(\mathbf {x}\) is the HR training image and \(\mathbf {y}\) is the LR training sample, and we take \(a=1, b=1\) and \(c=0\) in our work.

As we can see from Eq. 7 and Eq. 8, unlike the conventional losses designed for GAN, Ra-GAN uses relative average difference from the discriminator between HR and LR training samples within the training batch, which provides more stable training process. And the use of LS-GAN helps the proposed system avoid saturation during the GAN’s training phase.

4 Experiments and Results

4.1 Datasets and Evaluation Metrics

In order to train the proposed super-resolution document generator, SmartDoc-QA dataset [23] and B-MOD dataset [14] were employed.

SmartDoc-QA dataset contains a total of 4260 camera captured document images whose size are \(3096 \times 4128\). All these images were obtained based on 30 noise-free documents through two types of mobile phone cameras with different types of distortions and degradations, such as uneven lighting, out-of-focus blur, etc. In our experiment, we selected 90% images from this set for the supervised training in the first training phase, and the remaining 10% images were used for tuning purpose.

Prior to the supervised training, the captured images for SmartDoc-QA dataset were rectified by using RANSAC based perspective transformation, where the matching points between each LR image and the corresponding HR image were extracted by SURF features.

In B-MOD dataset, the document images were captured using the same manner as the SmartDoc-QA dataset but with larger number of samples, where 2,113 unique pages from random scientific papers were collected and photographed, which caused a total of 19725 images obtained by using 23 different mobile devices. In this dataset, although the rectified document images were available but the correspondences between these rectified images and their HR counterparts were not provided. Thus, in our experiments, we used the HR and LR images in this dataset to accomplish the unsupervised training for the GAN.

Because there is no public benchmark dataset available for the document image super-resolution task, we used SoC dataset [15] to evaluate the performance of the proposed document image super-resolution system. In SoC dataset, a total of 175 document images were captured from 25 “ideally clean” documents with different focus lengths. For each image, the ground truth text was also provided.

In Fig. 4, sample images from these three datasets are demonstrated.

Fig. 4.
figure 4

Sample images from the training and evaluation datasets. (a) Rectified sample image from SmartDoc-QA dataset. (b) Rectified training image from B-MoD dataset, where the markers around the image are used for rectification. (c) Raw image from the SoC dataset for evaluation.

To measure the performance of image super-resolution, peak signal-to-noise ratio (PSNR) is a widely used metric where higher value means better performance. However, PSNR only considers the difference between the pixel values at the same positions but ignores human visual perception, which causes this approach to provide poor performance in representing the quality of SR images. Furthermore, as we are more interested in improving the OCR performance through the SR document image and expect the resolved images not only have higher resolution but contain less noises than the original LR images, thus we use character error rate (CER) of OCR outputs to measure the proposed document image super-resolution system in this work.

4.2 System Training

As described in Sect. 3.2 and Sect. 3.3, we train the proposed super-resolution generator in two training phases.

In the first training phase, the supervised training was carried out, where the training data from SmartDoc-QA dataset was employed. To train the generator \(\mathcal {G}_\theta \), we randomly cropped small patches from the rectified camera captured document images to feed into the generator. The patches of the same location from the corresponding noise-free images were also cropped and used as the ground truths to guide the training of generator \(\mathcal {G}_\theta \). In our experiment, the patches’ size were \(256 \times 256\).

In the second training phase, the rectified camera captured images from B-MOD dataset were used as the LR training samples, and 2,113 clean images from this dataset, along with other 3,554 in-house noise free document images were utilized as the HR training samples. These unlabeled training images were randomly cropped and fed into the proposed system to train the GAN, and the relativistic average losses were used to guide the training of generator \(\mathcal {G}_\theta \) and \(\mathcal {D}_\xi \).

In this experiment, the Adam optimizer was used in both supervised and unsupervised training phases with initial learning rate of \(10^3\). The training was stopped until the improvement of training loss was smaller than a threshold.

4.3 Evaluation of Generator \(\mathcal {G}_\theta \)

To evaluate the proposed document image super-resolution system with different losses and neural network architectures, we trained multiple generator \(\mathcal {G}_\theta \) with different settings. The baseline system we obtained was a super-resolution document image generator trained using supervised training only with MSE loss, where the attention layers were also removed from the generator. In the meantime, a generator of the same architecture but with M-SSIM loss was also trained based on supervised training strategy. In the third SR document image generator, we added the attention layers into the network as shown in Fig. 3. Based on the third generator, we retrained it in the framework of GAN and obtained the final SR document image generator.

To assess the performance of these different systems, the 175 images from SoC dataset were applied. By considering these degraded images as enlarged images from its LR counterparts, we used them as the inputs of SR image generator \(\mathcal {G}_\theta \) directly to produce the SR document images. With the obtained SR images, Tesseract OCR engine was used to retrieve their transcripts and the CERs were also calculated. In Table 1, the CERs of the original image and the images generated from the SR system we implemented are listed. It can be observed that the CER of the original degraded document images are as high as \(22.03\%\). By using CNN based SR image generator, the CERs of the produced document images are decreased to \(17.00\%\). With the introduced M-SSIM loss and attention layers, we can see the CERs of those restored images are decreased further to \(14.95\%\). From the last row of this table, it can be seen that the images generated by the SR system that was trained based on GAN achieve the lowest CER \(14.40\%\), which shows the effectiveness of the proposed M-SSIM loss, attention layers and the GAN training scheme for the SR system.

Table 1. OCR performances for different SR document image generators trained in our work, where the original image is taken as the system input.

4.4 Comparison with the State-of-the Arts

To measure the OCR performance of the proposed SR document image generating system and other state-of-the-art SR approaches, a comparison experiment was implemented where the images in the SoC dataset were downsampled by the factor of 4 initially. Then, those down-sized images were super resolved by using BiCubic interpolation, SRCNN super-resolution [6], SRGAN super-resolution [17], and the proposed super-resolution approach, respectively. In this experiment, we retrained SRGAN using the same training data from SmartDoc-QA and B-MOD datasets. Need to note that because the proposed SR document image generator \(\mathcal {G}_\theta \) takes the input whose size is the same as the output, we upsampled the down-sized images using the nearest neighbor interpolation prior to send them to the proposed generator. The obtained SR document images were transcribed by using Tesseract OCR engine, whose CERs are reported in Table 2.

Table 2. Comparison between the proposed SR approach and the state-of-the-art methods w.r.t. OCR performances, where the original image is downsampled 4 times before sending as the system input.

As can be seen from this table, for document images which were down-sampled by factor of 4, the resolved images obtained by the proposed methods achieved almost the same OCR accuracy as the original images. But the images produced by other methods, even they were visually close to the original image still had worse OCR performance.

Fig. 5.
figure 5

Output SR images by different approaches.

In Fig. 5, the original document image and the restored SR images by different approaches are also illustrated. We can observe from this figure that unlike the SR images generated by other methods, which still have lots of background noises and uneven/bad lighting effects, the image produced by the proposed SR method removes all the lighting effects and provides clean backgrounds. The main reason for this advantage is that the proposed SR system uses GAN to train the generator \(\mathcal {G}_\theta \) which does not rely on HR/LR image training pairs. So we can collect large amount of ideally noise free document as the targets to guide the training for discriminator and generator, which makes the obtained generator can produce high quality images with less noises.

5 Conclusion

In this work, we propose an end-to-end neural network for SR document image generation, where the image generator is composed by a encoder and a decoder, which are connected by the proposed attention layers. To obtain the SR document generator, a two training phase strategy is implemented, where in the first stage, the supervised training is carried out. And in this stage, a M-SSIM loss is designed which effectively improves the performance of SR w.r.t. OCR accuracy. In the second training phase, a combination of LS-GAN and Ra-GAN is applied, which largely relaxes the restriction for the training samples. So we can collect a large number of unlabeled training samples, including ideally noise free images, to train the proposed SR system. The experimental results show the effectiveness of the proposed SR methods based on the OCR performances. In the future, we will explore other loss functions, including OCR based loss functions to guide the training of SR image generator.