Keywords

1 Introduction

Automatic information extraction from business documents is still a challenging task due to the semi-structured nature of such documents [3]. While an instance of a specific document type contains a predefined set of document fields to be extracted (e.g. date, currency, or total amount), the positioning and representation of these fields is not constrained in any way. Documents issued by a certain company, however, usually have a specific layout.

Popular word classification approaches to information extraction, e.g. [7], require huge datasets of labeled images, which is not feasible for many real-life information extraction tasks.

Convolutional neural networks (CNN) have been used extensively for document segmentation and document classification and, more broadly, for detecting text in natural scenes (e.g. [2]). In the case of CNNs, nets are also trained on explicitly labeled datasets with information about the targets (e.g. pixel level labels, bounding boxes, etc.).

The latest deep neural network architectures can be trained directly end-to-end to extract relevant information. Method proposed in [8] takes the spatial structure into account by using convolutional operations on concatenated document text and image modalities, with the text of the document extracted using an Optical Character Recognition (OCR) engine.

Our main goal has been to develop a system capable of predicting the positions of document fields on documents with new layouts (or even on documents of new types) that were previously never seen by our system, with learning performed on a small number of documents labeled by the user. To achieve this goal, we propose a method that relies exclusively on the modality of document images, as the complex spatial structure of business documents is clearly reflected in their image modality. The proposed method takes into account the spatial structure of documents, using as a basis a very popular approach in computer vision, Bag-of-Words (BoW) model [10].

The BoW model has been widely used for natural image retrieval tasks and is based on a variety of keypoint detectors/descriptors, SIFT and SURF features being especially popular. State-of-the-art key-region detectors and local descriptors have also been successfully used for document representation in document classification and retrieval scenarios [4], in logo spotting [9], and in document matching [1].

However, document images are distinctly different from natural scenes, as document images have an explicit structure and high contrast, resulting in the detection of numerous standard key regions. Classically detected keypoints do not carry any particular semantic or structural meaning for the documents. Methods specifically designed for document images make explicit use of document characteristics in their feature representations. In [12], it was proposed to use as keypoints the centers of connected components detected through blurring and subsequent thresholding, and a new affine invariant descriptor Locally Likely Arrangement Hashing (LLAH) that encodes the relative positions of key regions. In [5], key regions are detected by applying the MSER algorithm [6] to morphologically preprocessed document images.

Inspired by the results of [5, 12] and following the BoW approach, we generate a document-oriented codebook of visual words based on key regions detected by MSER and several types of compound local descriptors, containing both photometric and geometric information about the region. The visual codebook is then used to calculate statistical predicates for document field positions based on correlations between visual words and document fields.

2 Method

The main idea of our method is to build a codebook of visual words from a bank of documents (this is similar to the BoW approach) and apply the visual codebook to calculate statistical predicates for document field positions based on the spatial appearance of the visual words on the document. We use connected components extracted by the MSER algorithm from a set of morphologically preprocessed document images as our key regions. Next, local descriptors can be calculated in such key regions using various different techniques. The codebook consists of the centers of clusters obtained for the local descriptors (such centers are also known as “visual words”). We use the mutual information (MI) of two random variables, the position of a document field position, and the position of a particular visual word as a measure of quality for that visual word. The integrated quality of the visual codebook can be estimated as the average value of MI over all visual words. We predict target document field positions via conditional histograms collected at the fixed positions of the individual visual words. The integrated prediction of field position is calculated as a linear combination of the predictions from all the individual visual words detected on the document.

2.1 Keypoints Regions Extraction

To extract a keypoint region from a document image, we apply a MSER detector after morphological preprocessing. More specifically, we combine all the MSER regions detected on the original document image and on its copies obtained by a sequential application of an erosion operation. Examples of extracted rectangles of MSER regions of different sizes are shown in Fig. 1.

Fig. 1.
figure 1

An original invoice image (a) and bounded rectangles of MSER regions of different sizes extracted from the image. The area of the extracted regions is less than 0.005 (b), 0.01 (c), 0.05 (d) of the image area. The color of a region represents the size of the area (the smallest region is shown in red and the largest region is shown in blue). (Color figure online)

MSER regions are roughly equivalent to the connected components of a document image produced over all the possible thresholdings of the image. Such key regions correspond to the structural elements of the document (i.e. characters, words, lines, etc.). Combined with iterative erosion preprocessing, the MSER algorithm provides an efficient multi-scale analysis framework. It has been shown that MSER regions perform well in matching tasks of document analysis [11].

2.2 Calculation of Local Descriptors

Various local descriptors have been used in document image processing, both photometric (e.g. SIFT) and geometric (e.g. LLAH [12]). In our work we considered the following photometric descriptors of extracted MSER regions: popular SIFT, SURF and two descriptors composed using DFT or DWT coefficients (all were calculated for a grayscale image). Additionally, we concatenate the photometric descriptor with the geometric descriptor. The last one can consist of several components, including the size of the region, its aspect ratio, etc. Before calculating a local descriptor, we build a bounding rectangle for each extracted MSER region. Then we transform corresponding rectangular region of the document image into a square region. Next, we calculate a local descriptor for each obtained square region.

2.3 Building a Visual Codebook

To obtain the codebook, we use our private set of 6K invoice images. We extract 50K – 80K local descriptors from these invoice images. Vector quantization is then applied to split the descriptors into N clusters, which will serve as visual words for further image analysis. Quantization is carried out by K-means clustering, though other methods (K-medoids, histogram binning, etc.) are certainly possible. For each cluster, we calculate the standard deviation of its local descriptors from the codebook images. Next, we normalize the distance between the descriptor and the center of the cluster by the standard deviation, so that the Euclidean distance may be used later on when detecting visual words. It should be noted that in the described scenario, the dataset of 6K unlabeled invoices is used only once at the development stage to generate a high quality visual codebook. In Subsect. 3.2, we describe our experiments with receipts, demonstrating that the obtained codebook can be used for processing different types of documents.

2.4 Assessing and Optimizing the Codebook

To assess the quality of the obtained codebook, we use another private dataset of 1K invoice images, which are different from the images that were used to create the codebook. In this dataset, the important fields (e.g. “Invoice Date” or “Total”) are explicitly labeled. From each document in this second dataset, we extract all the key regions and their corresponding local descriptors. Each extracted local descriptor is then vector-quantized using the nearest visual word in the codebook (i.e. the nearest center of clusters obtained when creating the codebook). We will refer to this procedure as “visual word detection.”

Thus we detect all the available visual words in our second dataset of invoices. Next, we calculate a two-dimensional histogram \(h(W_i,W_j)\) of coordinates \((W_i,W_j)\) for a particular visual word W.

We can also calculate a two-dimensional histogram \(h(F_i,F_j)\) of coordinates \((F_i,F_j)\) for a particular labeled field F.

Finally, we can calculate the following conditional histograms:

  • conditional histogram \(h(F_i,F_j|W_k,W_l)\) of the position for the field F under the fixed position \((W_k,W_l)\) for the visual word W,

  • conditional histogram \(h(W_i,W_j|F_k,F_l)\) of the position for the word W under the fixed position \((F_k,F_l)\) for the invoice field F.

Bin values of two dimensional histograms are calculated for the cells from a spatial grid of \(M \times N\) elements. We set \(M=N=16\) for invoice images.

If we have all of the above histograms, we can calculate the mutual information \( MI (W,F)\) of two random variables, the position of the document field F, and the position of the visual word W as

$$\begin{aligned} MI (W,F) = H(F) - H(F|W) = H(W) - H(W|F), \end{aligned}$$
(1)

where H(F), H(W) are the marginal entropies of random positions F and W, calculated using the histograms \(h(F_i,F_j)\) and \(h(W_i,W_j)\);

H(F/W) is the conditional entropy of F given that the value of W is known, calculated using the conditional histogram \(h(F_i,F_j/W_k,W_l)\) and subsequent averaging of the result over all possible positions \((W_k,W_l)\).

A similar approach is used for H(W/F).

The mutual information \( MI (W,F)\) of two random variables, the position of the document field F, and the position of the word W is a measure of the mutual dependence between the two variables. Hence, if we average \( MI \) over all the visual words in the codebook, we may use \( MI \) as an integrated quality measure of the codebook for a particular document filed F (e.g. “Total” field in the case of invoices).

We determined that the best values of \( MI \) corresponded to the following values of the main codebook parameters:

  • the photometric local descriptor is composed using DFT coefficients;

  • for DFT calculation, the bounding rectangle of an extracted MSER region on a grayscale image is transformed into a square area of \(16 \times 16\) pixels;

  • the geometric descriptor consists of only two components – the size of the MSER region and its aspect ratio;

  • both descriptors are concatenated into a compound local descriptor as components, and the weight of the geometric descriptor is equal to 1/10;

  • the size of the codebook \(N=600\).

Again, it should be noted that this optimization procedure, using a large number of labeled invoices, is performed only once at the development stage.

2.5 Calculating Statistical Predicates

Once we have a visual codebook built on 6K unlabeled invoice images and optimized on 1K labeled invoice images, we can calculate a statistical predicate \(P(F_j)\) for the position of a field \(F_j\) on any invoice document.

As in the previous section, for each visual word from the codebook we can calculate a conditional histogram \(h(F_i,F_j|W_k,W_l)\) of the position for the particular field F under the fixed position \((W_k,W_l)\) for the visual word W over the labeled dataset. However, when calculating this histogram, we use a shift S of the field F position relative to the fixed position of the word W for spatial coordinates.

Then we can calculate the integral two-dimensional histogram h(S(FW)) of the shift S of the position of the field F that will incorporate the shifts relative to all the possible positions of visual word W in the labeled dataset. The set of N shift histograms \(h(S(F,W_j))\) for all the visual words \(W_j\) from the codebook, together with the codebook itself, are the complete data which is sufficient to calculate statistical predicates of invoice fields positions in our method.

If we are presented with a completely new document from which fields must be extracted, we first detect all the visual words. Then, for each instance of the codebook visual word \(W_k\), we calculate the predicate \(P_{ik}(F)\) of the possible position of the field F using the appropriate shift histogram \(h(S(F,W_k))\), stored together with the codebook. The integral predicate \(P_k(F)\) of the possible position of the filed F based on all the instances of the visual word \(W_k\) is calculated as the sum of the individual predicates \(P_{ik}(F)\) for all the instances of the visual word \(W_k\) in the document.

Note that for an instance of the visual word \(W_k\) on the document, a portion of the shift histogram \(h(S(F,W_k))\) may not contribute to the calculation of the predicate for this visual word. This is because big shifts may result to a field F position estimation, which lies outside of the area of the document image.

The integral predicate P(F) of the possible position of the field F based on the appearance of all the visual words on the document may be calculated as a linear combination of the individual predicates \(P_k(F)\) from the various visual words \(W_k\) detected in the document. Figure 2 demonstrates the statistical predicates of the “Total” field on an invoice image. Note that individual predicates based on individual visual words may poorly predict the position of a field, but an integral predicate, calculated over all the instances of all the visual words detected on a document, performs sufficiently well (typically, we detected 70–120 visual words per invoice).

Fig. 2.
figure 2

From left: the original image, the integral predicate for the “Total” field, an individual predicate for the “Total” field based on an instance of an individual visual word with index 411 from our codebook of 600 words. The “Total” field is marked by a blue rectangle. The instance of the individual visual word is marked by a green rectangle. The color palette shows the colors used for different predicate values (from 0 at bottom to the maximum value at the top of the palette). The size of the grid is \(16 \times 16\) elements. (Color figure online)

It can be seen that our statistical predicate P(ij|F) is a two-dimensional array of the probabilities of the document field F appearing in different cells (ij) of the spatial grid \(M \times N\) imposed on the image. When calculating histograms using the dataset of labeled invoice images, we assume that a particular cell contains the field F (or the word W) if the center of the field’s (or word’s) rectangle is located inside the cell.

The prediction of the position of a field F may be determined by the position of the elements of the predicate array with top n values. We refer to the grid cells containing n maximum values of the predicate P(ij|F) as “top-n cells.” In our experiments, we used the following metrics to measure the accuracy of the proposed method:

  • top-1 accuracy, which is the percentage of correct predictions based on the grid cell with the top value of the statistical predicate P(ij/F);

  • top-3 accuracy, which is the percentage of correct predictions based on the grid cells with top 3 values of the statistical predicate P(ij/F);

  • top-5 accuracy, which is the percentage of correct predictions based on the grid cells with top 3 values of the statistical predicate P(ij/F).

3 Experiments

In all our experiments, we used the visual codebook created using a dataset of 6k unlabeled invoice images with approximately 72K local descriptors. This is a random subset of our private database of 60K invoices, which includes a variety of documents from different countries and vendors. The local descriptor for the codebook was composed using the DFT brightness coefficients and the size and aspect ratio of grayscale image regions detected by the MSER algorithm. The codebook contained 600 visual words.

3.1 Invoices

For invoices, we used our private dataset described above and focus on the detection of three fields: “Total,” “Currency,” and “Invoice Date.” We describe the detection of only three fields here, but the proposed method imposes no limits on the number of extracted fields.

To calculate two-dimensional histograms, we apply a grid of \(16 \times 16\) cells. A subset of 34 images sharing the same layout and originating from the same vendor was arbitrarily split into 15 training images and 19 test images. Examples of images from this subset are shown in Fig. 3.

Fig. 3.
figure 3

Examples of invoice images from the experimental subset. All the invoices have the same layout and were issued by the same vendor.

For accuracy measurements in all our experiments we apply cross-validation. In this experiment, the integrated predicate was calculated as a simple sum of the predicates of individual visual words. The results are shown in Table 1. Note that accuracy differs for the three fields. This is because their positions fluctuated within different ranges even within the same layout. Note also that we obtained a top-10 accuracy of 0.918 (averaged over 3 fields) using only 5 labeled images for training. In doing so, we measured the accuracy of the prediction that the field can appear in the area of \(\sim \)3.8% of the entire image. This information will prove valuable when the proposed method is used as an assisting tool in approaches based on text recognition. Due to certain technical limitations, Table 1 does not contain detailed data for this accuracy assessment.

Table 1. Accuracy for invoices sharing the single layout.

3.2 Receipts

A public dataset of labeled receipts from the ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction was used in our experiments. The dataset is available for download at https://rrc.cvc.uab.es/?ch=13.

We focused on three fields: “Total,” “Company,” and “Date.” To calculate two-dimensional histograms, we used a grid of \(26 \times 10\) cells. This grid size was chosen to reflect the geometry of the average receipt.

Experiment 1. For the first experiment, we chose 360 images from the dataset where receipts occupied the entire image. This was done by simply filtering the images by their dimensions. The resulting subset contained a lot of different layouts from different companies.

Images were arbitrarily split into 300 training images and 60 testing images. We calculated the statistical predicates of the field positions on the training images and measured the accuracy of our predicates on the testing images. In this experiment, the integrated predicate was calculated as a simple sum of the predicates of the individual visual words. The results are shown in Table 2.

Table 2. Accuracy for a mixture of receipts with varying layouts. 300 labeled images were used for training. The prediction is the sum of the individual predicates.
Table 3. Accuracy for a mixture of receipts with varying layouts. 300 labeled images were used for training. The prediction is the linear combination of the individual predicates. The weights are equal to the mutual information values for the individual visual words in degree 2.

Table 3 shows the results of the same experiment, but the integrated predicate for the field F is calculated as a linear combination of the predicates from the individual visual words, with their weights equal to the mutual information values \(MI(F,W_k)\) for the individual visual words \(W_k\) in degree 2, see (1). The average accuracy here is about 1% higher than in Table 2 where we used a simple sum to calculate the integrated predicate. From this we conclude that calculation of integrated predictions can be further optimized and may become the subject of future research.

Experiment 2. For the second experiment, we first chose a subset of 34 receipts sharing the same layout and originating from the same company (Subset A). Examples of images from Subset A are shown in Fig. 4. Subset A was arbitrarily split into 10 training images and 24 test images. In this experiment, the integrated predicate was calculated as a simple sum of the predicates of the individual visual words. The results obtained on Subset A are shown in Table 4. Note that we obtained a top-10 accuracy of 0.769 using only 5 labeled images for training.

Fig. 4.
figure 4

Examples of receipts from Subset A. Notice the large variations in the field positions within the same receipt layout.

Fig. 5.
figure 5

Examples of receipts from Subset B. Notice the moderate variations in the field positions within the same receipt layout.

For this experiment, we had a limited number of receipts sharing the same layout. We assume, however, that we can achieve greater accuracy by simply using more training images.

Table 4. Accuracy for receipts from sharing the same layout. Receipts in Subset A display large variations in layout.

Next, we chose another subset, this time containing 18 receipts (Subset B). Subset B was arbitrarily split into 5 images for training and 13 images for testing. Layout variations in Subset B were much smaller than in subset A. Examples of images from Subset B are shown in Fig. 5. The results obtained on Subset B are shown in Table 5.

Table 5. Accuracy for receipts sharing the same layout. Receipts in Subset B display moderate variations in layout.

The experiments with receipts demonstrate that the average top-3 accuracy was greater for documents sharing the same layout than for documents with different layouts. Moreover, same-layout receipts were successfully trained on only 10 labeled documents, while it took 300 labeled documents to achieve similar levels of accuracy for receipts with different layouts.

Note that the number of test images in our experiments was relatively small (13..24), as our datasets contained only a limited number of images sharing the same layout.

4 Conclusion and Future Work

In this paper, we presented a system of document field extraction based on a visual codebook. The proposed system is intended for a processing scenario where only a small number of labeled documents is available to the user for training purposes. Our experiments with a publicly available dataset of receipts demonstrated that the system performs reasonably well on documents sharing the same layout and displaying moderate variations in the field positions.

We achieved the following values of average top-5 accuracy with only 5 labeled receipt images used for training:

- 0.639 for experimental Subset A with large layout variations;

- 0.969 for experimental Subset B with moderate layout variations.

The experiments on receipts were conducted using a visual codebook built on 6 K dataset of unlabeled documents of a different type (invoices).

At the same time, an average top-5 accuracy of 0.828 was achieved for invoice images sharing the same layout with only 5 labeled documents used for training. The top-10 accuracy was 0.918. Top-10 accuracy in our experiments is the accuracy with which the system can predict that the center of a field will occur in an area constituting \(\sim \)3.8% of the entire image.

We may conclude that the proposed method can be used on its own when the user can or is willing to label only a few training documents, which is a frequent situation in many real-life tasks. The proposed method may also be used as an assisting technique in approaches based on text recognition or to facilitate the training of neural networks.

Future research may involve adapting the proposed method to processing ID documents. Preliminary experiments with ID documents have shown that a visual codebook built on grayscale invoice images can perform reasonably well on some ID fields. Better performance may be achieved if an invoice-based codebook is enriched with important visual words corresponding to elements of colored and textured backgrounds of ID documents.

Correlations between performance and the parameters of our codebook should be further investigated.

Another possible approach to improving the performance of the proposed system is to use modern automatic optimization algorithms, especially differential evolution. This should allow us to determine more accurate values for two dozen parameters. Additional local descriptors can also be easily added.

The system’s accuracy may also be improved by optimizing the calculation of integrated predictions from individual visual words contributions.