1 Introduction

Documents are a common way to share information and record transactions between people. In order to digitize mass amounts of printed documents, the hard copies are scanned and text is extracted automatically by Optical Character Recognition (OCR) systems, such as [11, 12]. In the past, most documents were scanned in flatbed scanners. However, the past few years have seen a rise in the use of smartphones, and with it the use of the smartphone camera as a document scanner. Camera captured documents such as receipts are often folded, curved, or crumpled, and vary greatly in camera angles, lighting and texture conditions. This makes the OCR task much more challenging compared to scanned images.

Recent OCR methods have had great success in recognizing text in very challenging scenarios. One example is scene text recognition [1, 18] which aims to recognize text in natural images. The text is often sparse, and may also be rotated or curved. Another scenario is retrieving the content of a document with dense text, that poses the challenge of detecting and recognizing many words that are closely located.

Fig. 1.
figure 1

Overview of CREASE. CREASE is a document rectification method that learns content based signals in addition to the document’s 3D structure in order to estimate a transformation used for rectifying the image. On the left is a synthetically generated input image, and on the right is the image rectified using the transformation that CREASE predicted. The bottom images are the supervision signals, from left to right: The 3D coordinates of the warped document, the angle deformation map, the curvature map, and the text mask.

While recognizing dense text, and similarly detecting sparse curved text, had been studied thoroughly, the combined problem of both dense and warped text detection and recognition has received significantly less attention. Many text detectors assume axis-aligned text and struggle with deformed lines  [8, 26], while text recognition systems struggle with fine deformations on the character level. Taking this into account, a line of works proposed to rectify the document as a pre-process to the recognition phase. Recent methods harnessed the power of deep learning to solve this task [6, 13], but put more emphasis on the page boundaries and less emphasis on the contents of the document.

In this paper, we present CREASE: Content Aware Rectification using Angle Supervision. This method performs document rectification by relying on both global and local hints, with an emphasis on content. Our method predicts the 3D structure globally while simultaneously optimizing for the local structure of both the text orientation, the location of folds and creases, and the output backward map. CREASE provides results that are superior in readability, similarity and geometric reconstruction.

CREASE predicts the mapping of a warped document image to its “flatbed” version. First, we estimate the 3D structure of the input document. Then, we transform this estimation into a mapping, specifically, a backward mapping. Finally, the mapping is used to resample the warped image into the flattened form. A general overview of CREASE is given in Fig.1. Our contributions are as follows:

  1. 1.

    We present a per-pixel angle regression loss that complements the 3D structure estimation by optimizing different aspects of the rectification process.

  2. 2.

    We present a curvature estimation task, which predicts the lines along which the document is crumpled or folded, complementing the per-pixel angle regression loss by emphasizing its discontinuities.

  3. 3.

    The losses are learned as side tasks, focusing on the areas of the document that contain strong signals regarding the text orientation, and are optimized alongside the 3D structure estimation in an end-to-end optimization process.

  4. 4.

    We reduce the relative OCR error in a challenging warped document dataset by \(20.2 \%\) and the relative geometric error by \(14.1 \%\), compared to the state-of-the-art method.

We train CREASE using synthetic data, which provides us with intricate details regarding each document in our training set without requiring manual annotation: the ground-truth transformation for every pixel, the 3D coordinates, angles, curvature values for every pixel, and the text segmentation mask.

We present visual and quantitative results and comparisons on both synthetic and real evaluation datasets. We also present a detailed study of the contribution of each individual model component.

2 Background

Many works have addressed the problem of extracting text from documents captured in challenging scenarios. These works have focused on different elements that improve OCR accuracy, such as illumination and noise correction [2, 16], resolution enhancement [27], and document rectification  [6, 7, 13]. This work focuses on the last problem of rectifying a warped document from a single image by prediction of the 3D model.

Early document rectification methods used hand crafted features to detect the structure of a document. These methods usually made strong assumptions on the deformation process such as smoothness [4, 10] and folding structure [7]. Several works utilized special equipment to capture the 3D model of the document  [3], or reconstructed the 3D model from multi-view images [25].

More recently, approaches such as [6, 13, 16] have used deep learning to rectify single-image, camera-captured documents, and were designed to solve the rectification problem by directly predicting the document warp. These works placed the focus on the aforementioned warp, and gave no special treatment to the content of the document, i.e., the data we wish to ultimately recognize.

The first in this line of works is DocUnet [13], which used a stacked hourglass architecture to predict the original 2D coordinates of each pixel in the warped document. This prediction, in essence, gives the forward mapping from the rectified image to the warped one, which can be inverted to get the final result.

A followup work presented DewarpNet [6], which added three learned post-processing components for calculating the backward map, the surface normals and the shading, each as a separate hourglass network. Additionally, the original 2D forward map of [13] was replaced by a prediction of its 3D counterpart, and the stacked hourglass was substituted by a single hourglass network for this prediction. This method mostly relied on the document boundary, and did not explicitly address the document’s content in the rectification module.

Another work by Li et al.  [16] focused mainly on uneven background illumination, however it did this by predicting the document warp. This work computed a forward map in a method similar to [13], but divided the prediction into three phases. First, a local, patch-based network predicted the gradients of the forward map. Then, a graph-cut model stitched these patch-predictions into a global warp. Finally, the un-warped image underwent an illumination correction. The local and global level prediction allowed the warp estimation to take into account both the document boundaries and areas in the center of the document, but this pipeline required an expensive patch-stitching process and worked best on input documents with minimal background. Additionally, this method was not end-to-end trainable and did not take into account the document content.

One of the key points in our approach is the importance of predicting text orientation in the warped image. The notion of text angle prediction was previously explored in scene text recognition at the word (or, object) level [17, 19, 28], as opposed to our pixel-level approach.

The EAST text detector by Zhou et al.  [28] and FOTS detector by Liu et al.  [17] both predicted the angle for each word detection candidate in conjunction with other parameters, like bounding box size and quadrangle coordinates. Ma et al.  [19] extended upon the Faster-RCNN [20] architecture by adding rotated anchors to accommodate for arbitrarily oriented text. It is important to stress that scene text methods deal with a sparse set of words, and moreover, each word is rectified separately. Documents, on the other hand, benefit more from a rectification process before word localization, due to the denser text but also due to a stronger prior on the structure thereof. To make use of this prior, CREASE applies an angle regression loss at the pixel level, focused on the salient text areas in the document, and optimized in an end-to-end manner over the predicted backward map.

3 Method

We design CREASE to exploit a document’s content and geometry on both local and global levels. CREASE addresses different aspects of the input, such as global structure, creases and fold lines, and per-pixel angular deformation. This allows it to capture an accurate mapping for the entire document globally, and for fine-grained features such as characters and words locally (without detecting them explicitly).

First, we present the general architecture of our model (Subsect. 3.1). Next, we present the properties of documents that CREASE relies on: the flow field angles (Subsect. 3.2), and the curvature estimation (Subsect. 3.3). Finally, we present the optimization objective tying the various signals together (Subsect. 3.4).

Fig. 2.
figure 2

Architecture of CREASE. The rectification process (in orange arrows) contains two steps: a 3D estimator predicts the 3D coordinates of the document in the image, and a backward map estimator that infers the backward map from the 3D estimation. The input image is rectified using the backward map. Red and blue frames denote ground truth supervisions and predictions, accordingly. Black arrows denote the training process, and the losses used for optimization. Training is performed first on the 3D estimation model, and is then fine-tuned in an end-to-end fashion. (Color figure online)

3.1 Architecture

CREASE is comprised of a two-stage network, illustrated in Fig. 2. The first stage is used for estimating the location of each pixel in a normalized 3D coordinate system, the warp field angle values, and the curvature in each pixel. The 3D estimation module is followed by a backward mapping network. This network transforms the estimated 3D coordinate image into a backward map that can be used for rectifying the input image.

3D Estimation. The first stage provides per-pixel estimation for 3D coordinates (based on [6]) along with the angle and curvature outputs, used as side-tasks. A Unet  [21] based architecture is used for mapping an input image into the angle, curvature and 3D coordinate maps. The three maps are used for supervision, and the 3D coordinate map also functions as the input for the backward mapping stage.

Backward Mapping. The second network stage transforms the 3D coordinate map outputted from the first stage to the backward mapping of the image. This mapping describes the transformation from the warped image to the rectified result. In other words, the backward map determines for every pixel in the rectified (output) image domain, its location in the input image.

The authors of [13] used a straightforward implementation of deducing the backward map by inverting their UV forward map prediction. This inversion is done by ‘placing’ the pixels of the forward map in the rectified image based on their values, and performing interpolation over the resulting non-regular gridFootnote 1. This inversion is very sensitive to noise in the forward map, e.g., if two neighboring pixels swap their predictions. The same task was addressed in  [16] using a parallel iterative method that isn’t applicable for an end-to-end differential solution.

We rely on the DenseNet  [9] based model provided by  [6] for the backward mapper in our solution. The work of [6] introduced a learned model for this warp and trained it in a manner independent of the input texture. We rely on their work for transforming our 3D coordinate maps into backward maps. The use of a differentiable backward mapper allows for end-to-end training of the model using a combined objective, optimizing the 3D estimation and backward mapping networks jointly using a combined objective, as discussed in Subsect. 3.4.

3.2 Angle Supervision

While the 3D estimation network learns the global document structure, we wish the network will also be aware of the local angular deformation that each point of the document has undergone during the warping. Angular deformation estimation complements the 3D regression used in [6] because it is more sensitive to small deformations that might warp parts of words. To calculate this value, we warp a local Cartesian system from the source to the target image. We use angular deformation estimation in two places in our framework.

Angle from Backward Map. The first place we estimate the angle is the backward map, where in each pixel we create two infinitesimal vectors \(\varepsilon _x\) and \(\varepsilon _y\), respectively directed at the x and y directions. We then measure the rotation that \(\varepsilon _x\) and \(\varepsilon _y\) undergo due to the warping process, and denote the resulting angles as \(\theta _x\) and \(\theta _y\). This process is illustrated in Fig. 3. These angles capture the rotation and shear parts of a local affine transform, without the translation and scale counterparts that are captured by the coordinate regression.

Fig. 3.
figure 3

Per-Pixel Angle Calculation. Illustration of the local warp angle calculation in three locations on the document. The green point is enlarged to provide better details, showing the resulting angles per axis, denoted by \(\theta _x\) and \(\theta _y\). This process is used to penalize for angle prediction errors in the the backward map. A similar process is done for the forward map, by selecting points from the warped image and transferring them using \(\pmb {\tau }\). (Color figure online)

Auxiliary Angle Prediction. In addition to deriving the angles from the backward map, we predict them directly from the 3D estimator network as two auxiliary prediction maps. These are learned in parallel to the 3D coordinates prediction, as shown in Fig. 2, to better guide the training, and are not used during test time. Specifically, each of the two angles \(\theta _x\) and \(\theta _y\) is derived from its own pair of channels, followed by a Cartesian-to-polar conversion (see supplementary material for details). This conversion yields, in addition to angles \(\theta _x\) and \(\theta _y\), corresponding magnitude values denoted \(\rho _x\) and \(\rho _y\). We use the magnitude values as ‘angle-confidence’ to penalize the angle loss proportionally. This is beneficial since predictions that have small magnitude are more sensitive to small perturbations.

Angle Estimation Loss. We employ a per-pixel angle penalty on the two aforementioned predictions: one as derived from the backward map, and the other as an auxiliary prediction. The per-pixel prediction provided by the 3D estimation network is masked by a binary text segmentation map. Often, the strongest deformations appear around the borders, and far from content. Masking content-less areas allows the loss to target areas of interest, and avoid bias towards the highly deformed boundaries. Our loss minimizes the smallest angle (modulo \(2\pi \)) between each of the predicted angles \(\{\theta _x, \theta _y\}\) and their ground truth counterparts \(\{\hat{\theta }_x, \hat{\theta }_y\}\). The per-pixel loss for angles is therefore:

$$\begin{aligned} L_{angle}(\pmb {\theta }, \pmb {\hat{\theta }}, \pmb {\hat{\rho }}) = \sum _{i \in \{x, y\}} \hat{\rho }_i \odot ( \Vert \theta _i - \hat{\theta }_i\Vert - \pi ) \mod 2\pi , \end{aligned}$$
(1)

where \(\odot \) denotes the Hadamard product. In the backward map angles the loss is without the confidence values \(\hat{\rho }\), that are set to \(\pmb {1}\), because the angles are derived from the backward map and are not predicted as an auxiliary.

3.3 Curvature Estimation

A key observation we utilize in this work is that the surface of a crumpled document behaves in manner similar to a 2D piecewise-planar surface. Each interface between two approximately-planar sections introduces a section of higher curvature, and higher local distortion. We wish to give the network a supervision signal that indicates the presence of such high curvature.

Intuitively, the more crumpled the paper, the more creases or discontinuities the warp function exhibits. The curvature map highlights non-planar areas of the paper, where 3D and angle regression might be less accurate. A point in the middle of a plane would have zero curvature, while a point at the tip of a needle would have the maximal curvature value. To generate this signal, we utilize the 3D mesh used to generate each document image.

Formally, for a paper mesh \(\mathcal {M}\) we calculate a curvature map \(H(\mathcal {M})\) using the Laplace-Beltrami operator, as defined for meshes in  [23]. The mean curvature per mesh vertex \(\mathbf {v}_i \in \mathbb {R}^{3}\) is obtained by:

$$\begin{aligned} H(\mathcal {M})_{i} = ||\sum _{j \in N_{i}} {(\mathbf {v}_i - \mathbf {v}_j)} ||_{2}. \end{aligned}$$
(2)

The maps used for supervision are created by thresholding the curvature, to avoid noise and slight perturbations while emphasizing the actual lines defining the global deformation for the paper. The maps are used as supervision and are predicted as an additional segmentation mask by the 3D estimation network.

3.4 Optimization

The optimization of our model consists of two stages: An initial training stage for the 3D estimation network using the side-tasks, followed by an end-to-end fine tuning stage in which the network is optimized w.r.t. a combined loss term.

3D Estimation Model. Initially, we optimize the 3D estimation model using a loss objective that includes the 3D coordinate estimation loss and the aforementioned auxiliary losses, described in Eq. (3). We denote the predicted and ground-truth normalized world coordinates \(\mathbf {C}\) and \(\mathbf {\hat{C}}\). The first loss term is the \(L_1\) loss over coordinate error, similarly to  [6]. The second term is the angle loss term presented in Eq. (1), masked by the binary text segmentation mask \(\mathbf {\hat{D}}\), averaged over all text containing pixels. The last term is the curvature estimation \(L_2\) loss. The 3D estimation loss is:

$$\begin{aligned} L_{3D} = ||\mathbf {C} - \mathbf {\hat{C}}||_1 + \mathbf {\hat{D}} \odot L_{angle} + ||\mathbf {H} - \mathbf {\hat{H}}||_2. \end{aligned}$$
(3)

End-to-end Fine Tuning. The first stage of our model may either be trained individually, or as part of an end-to-end architecture. When training the model end-to-end, the backward map B is inferred and used for penalizing the predicted 3D coordinates by the final result. We penalize the resulting backward map \(\hat{B}\) by the \(L_1\) loss as was done in [6], and additionally using our angle loss from Eq. (1). We append these penalty terms to the one in Eq. (3), resulting in the following combined end-to-end loss:

$$\begin{aligned} L_{combined} = L_{3D} + ||\mathbf {B} - \mathbf {\hat{B}} ||_1 + L_{angle}. \end{aligned}$$
(4)

4 Experiments

We evaluate CREASE on a new evaluation set comprised of 50 high resolution synthetic images, as well as on real images from the evaluation set proposed by  [13]. The synthetic dataset is generated with both warp and text annotations, useful for OCR based evaluations and for evaluating the individual stages of our model. We provide geometric, visual, and OCR based metrics, as well as qualitative evaluations. We compare our results to Dewarpnet  [6] trained on our training set, using the code and parameters that were published by the authors. As the method of Li et al.  [16] isn’t directly applicable to the task at hand, it is not evaluated in this section. Discussion and comparison to  [16] are provided in the supplementary material.

All models were trained on 15,000 high resolution images rendered using an extension of the rendering pipeline provided by  [6]. Our extensions include the generation of our supervision signals: text, curvature and angles in addition to the 3D coordinates provided by the original rendering pipeline. These added signals come at negligible cost and have no affect in test time. Further details regarding dataset generation are provided in the supplementary material.

4.1 Evaluation Metrics

OCR Based Metric. To correctly evaluate any word related metric, we must first obtain a set of aligned word location pairs, i.e., a matching polygon for each ground-truth bounding box in the predicted rectified image domain. Given the density and small scale of words in documents, a naive coordinate matching scheme is likely to fail, as a small global shift is to be expected even in the best case scenario.

During evaluation, we rectify an input image twice: using the network’s predicted backward map, and using the ground-truth map. We then use an OCR engine for extracting words and bounding boxes from the rectified images.

To properly match bounding boxes, we perform the matching stage in the input image domain, visualized in Fig. 4. Each bounding box extracted from a rectified image is warped back and becomes a polygon in the input (warped) image domain.

We define polygon intersection as our distance metric and match pairs using the Hungarian algorithm  [14]. With the paired prediction and ground-truth word boxes we can evaluate the Levenshtein distance  [15], or edit distance, denoted by \(E_d\). We first calculate the edit distance for each word in each document, then calculate the average edit distance over all the words in the dataset.

Following [6], we use an off-the-shelf OCR engine (Tesseract 4.0  [22]). This engine is quite basic, and does not reflect the advances and robustness of more modern OCR models. However, the vast majority of recent OCR methods are targeted at scene-text, with the number of proposed text instance detections often limited to 100-200. Thus, they are not suited to handle dense document text. As an alternative, there are a few commercial products designed to handle dense text recognition [11, 12] that are far more advanced than Tesseract. We choose one of them, [11], for an additional evaluation. Results are presented in Tables 1 and 2.

Fig. 4.
figure 4

OCR Polygon matching. Frames overlayed with OCR bounding boxes. Left: Input image overlayed with the warped polygons from images rectified using a predicted transformation (center) and the ground-truth transformation (right). Purple areas denote a correct match, blue and red areas were detected only in the ground-truth and prediction rectified images, respectively. (Color figure online)

Geometric and Visual Metrics. In addition to an OCR-based evaluation, we use two metrics for evaluating the geometric correctness and visual similarity of our results, End Point Error (EPE) and Multi-Scale Structural Similarity (MS-SSIM). The EPE metric is used to evaluate the calculated rectification warps and compare them to ground truth. Following  [16], we include evaluation for this metric in our benchmark.

The MS-SSIM  [24] metric quantifies how visually similar are the output images to the ground truth. Given that a small amount of shift is expected and is not considered an error, a naive evaluation using \(L_1\) or \(L_2\) metrics is not suited for our evaluation. Therefore, following  [6] we use the MS-SSIM metric which focuses on statistical measures rather than per-pixel color accuracy. Evaluating statistics rather than per-pixel accuracy also has its limitations, as character level rectification is a fine-grained task and improvements on this scale are not always manifested in this metric. In fact, SSIM is much more sensitive to small visual deformations in documents containing large amounts of text or sharp edges. Thus, we only use it to complement our finer-grained, OCR based metrics. For further discussion regarding the SSIM metric, see supplementary.

Table 1. Benchmark comparison using Tesseract OCR  [22]. For \(E_d\) and EPE, lower is better, while or SSIM, higher is better.
Table 2. Benchmark comparison using a commercial OCR model [11]

4.2 Implementation Details

Models are trained by first optimizing the 3D estimation network using 3D coordinates, text masks, curvature masks and local angle supervision signals to convergence. Starting from the converged 3D estimation models, we fine-tune our model in an end-to-end manner by using a fixed, pre-trained, differentiable backward mapper. We calculate the \(L_1\) and angle losses over the output backward maps and back-propagate the losses to the 3D estimation network. Training is conducted using 15,000 high-resolution images rendered in Blender  [5] using over 8,000 texture images. Further details regarding data generation are provided in the supplementary material.

4.3 Comparison to DewarpNet [6]

The first result we present is a comparison to the prior state-of-the-art trained on our training set, using the Tesseract [22] engine.

We show mean and standard deviation values over 5 experiments in Table 1. Our method improves the edit distance metric over the previous method by \(4.5\%\) absolute and \(20.2\%\) relative. We also see improvements in EPE and SSIM metrics, and a reduction in standard deviation for all three. The use of both angle regression and curvature estimation improves performance and stabilizes the optimization process, reducing the sensitivity to model initialization.

Next, we evaluate our method using the public online API of [11]. Results are presented in Table 2. In this case, our model still provides a \(5.1\%\) relative improvement. The commercial model [11] is superior to [22], reducing the mean edit distance from 0.178 to 0.103, yet CREASE still maintains a significant gap over DewarpNet of \(0.6\%\) absolute and \(5.1\%\) relative.

4.4 Evaluation Using Real World Images

Figure 5 depicts a qualitative comparison between our rectification method and [6] on the real images provided by [13]. Notice how the text lines rectified using CREASE are better aligned and easier to read than the other method’s outputs, especially for text near document edges. Additional examples are included in the supplementary material.

Table 3. Angle loss evaluation for the 3D estimation model.
Table 4. Ablation study.
Fig. 5.
figure 5

Visual Examples. Results from the real image dataset of  [13]. Left to right: input images, rectification of the input image according to output of  [6] trained using our data, rectification of the input image according to our model’s output transformation.

4.5 Angle Loss Evaluation

We show the contribution of the different elements of our angle-based loss presented in Sect. 3.2 for our metrics and for the OCR metric in particular in Table 3. ‘Angles’ refers to models trained with the angle loss applied to all image pixels, instead of only to those that contain text. ‘+ Mask’ refers to applying the text mask over the loss, i.e., taking the loss only in text-containing pixels, using the mask denoted by \(\mathbf {\hat{D}}\) in Eq. (3). ‘+ Conf.’ represents the use of the angle confidence values (denoted \(\rho \) in Eq. (1)). When not used, we set \(\rho \) to 1 for all pixels. We report results averaged over 5 experiments each, as well as the standard deviation. For this experiment, the curvature estimation term was omitted. Our contributions show a consistent improvement over the vanilla 3D estimation network and, in addition, a much more stable training framework with consistent results over multiple initializations.

4.6 Ablation Study

Table 4 shows the effect of each component of our method. Models trained using angle and curvature estimation are compared to vanilla models. We compare both models trained end-to-end (denoted E2E) and models trained separately. As seen before, the improvement in results is also accompanied by a decrease in standard deviation, especially for models trained using curvature estimation.

We evaluate the contribution of end-to-end training of our model using a fixed, differentiable backward mapper and losses derived from its results, i.e., the backward map and angle prediction errors (shown in Table 4). The top three rows refer to models that were not trained in an end-to-end fashion, while the three rows below (starting with ‘E2E’) refer to models trained end-to-end. ‘Angles’ and ‘Curvature’ denote the use of each of our two added auxiliary predictions.

The dual usage of the angle loss, in both the 3D estimation model and the end-to-end training, as well as the curvature estimation, result in much more readable rectification and a more stable training scheme than the previous state-of-the-art.

5 Conclusion

We presented CREASE, a content aware document rectification method which optimizes a per-pixel angle regression loss, a curvature estimation loss and a 3D coordinate estimation loss for providing image rectification maps.

Our method rectifies folded and creased documents using hints found in both local and global scale properties of the document, and provides a significant improvement in OCR performance, geometry and visual similarity based metrics. In our proposed two stage model, the first stage is used for predicting 3D structure, angles and curvature, while the second stage predicts the backward map. We utilize a pixel-level angle regression loss that is shown to be a beneficial side-task in both the 3D estimation and the end-to-end training. Furthermore, our 3D estimation model learns the angle side-task specifically on the words in the document, thus optimizing for readability in the rectified image, while the curvature estimation side-task complements the angle regression by mapping its discontinuities.

Extensive testing and comparisons show our method’s superior performance over diverse inputs, using both real and synthetic evaluation data. We show an increase in OCR performance, geometry and similarity metrics that is consistent over all experiments and on a variety of documents.