Keywords

1 Introduction

A document analysis system is an important component in many business applications because it reduces human effort in the extraction and classification of information present in documents.

While many applications use Optical Character Recognition systems (OCR) to extract text from document images and directly operate on it, documents often have an implicit visual structure. Helpful contextual information is given by the position of text in a page and, generally, the page layout. Reports containing tables and figures, invoices, resumes, and forms are difficult to process without considering the relationship between layout and textual content.

As such, while there are efforts in dealing with the visual structure in documents leveraging text [31, 47], relevant-sized datasets are mostly internal, and privacy concerns inhibit public release. Moreover, labeling of such datasets is an expensive and time-consuming process.

This is not the case for natural images. Natural images are prevalent on the internet, and large-scale annotated datasets are publicly available. The ImageNet database [9] contains 14M annotated natural images with 1000 classes and has powered many advances in computer vision and image understanding through training of high capacity convolutional neural networks (CNNs). ImageNet also provides neural networks with the ability to transfer the information to other unrelated tasks like object detection and semantic segmentation [21]. A neural network pre-trained on ImageNet has substantial performance gains compared to a network trained from scratch [49].

However, it was shown that pre-training neural networks with large amounts of noisily labeled images [46] substantially improves the performance after fine-tuning on the main classification task. This is indicative of a need to make use of a large corpus of partially labeled or unlabeled data. Moreover, modern methods of leveraging unlabeled data have been developed [11, 33], by creating a pretext task, in which the network is under self-supervision, and afterwords fine-tuning on the main task.

Unlike natural image datasets, document datasets are hard to come by, especially fully annotated ones, and have only a fraction of the scale of ImageNet [15, 29]. However, unlabeled documents are easily found online in the form of e-books and scientific papers.

Qualitatively, document images are very different from natural images, and therefore using a pre-trained CNN on ImageNet for fine-tuning on documents is questionable. Document images are also structurally different from natural images, as they are not invariant to scaling and flips. It has been shown that models trained on ImageNet often generalize poorly to fine-grained classification tasks on classes that are poorly represented in ImageNet [27]. While there are classes that are marginally similar to document images (i.e. menus, websites, envelopes), they are vastly outnumbered by other natural images. Moreover, models that are pre-trained on RVL-CDIP [15] dataset have a much better performance on document classification tasks with a limited amount of data [26].

Self-supervision methods designed for document images have received little attention. As such, there is a clear need for learning more robust representations of documents, which make use of large, unlabeled document image datasets.

This paper makes the following contributions to the field of document understanding:

  1. 1.

    We make a quantitative analysis of self-supervised methods for pre-training convolutional neural networks on document images.

  2. 2.

    We show that patch-based pre-training is sub-optimal for document images. To that end, we propose improved versions of some of the most popular methods for self-supervision that are better suited for learning structure from documents.

  3. 3.

    We propose an additional self-supervision method which exploits the inherent multi-modality (text and visual layout) of documents and show that the representations they provide are superior to pre-training on ImageNet, and subsequently better than all other self-training methods we have tested in the context of document image classification on Tobacco-3482 [28].

  4. 4.

    We make a qualitative analysis of the filters learned through our multi-modal pre-training method and show that they are similar to those learned through direct supervision, which makes our method a viable option for pre-training neural networks on document images.

2 Related Work

2.1 Transfer Learning

One of the requirements of practicing statistical modeling is that the training and test data examples must be independent and identically distributed (i.i.d.). Transfer learning relaxes this hypothesis [43]. In computer vision, most applications employ transfer learning through fine-tuning a model trained on the ImageNet Dataset [9]. Empirically, ImageNet models do transfer well on other subsequent tasks [21, 27], even with little data for fine-tuning. However, it only has marginal performance gains for tasks in which labels are not well-represented in the ImageNet dataset.

State-of-the-art results on related tasks such as object-detection [39] and instance segmentation [16] are improved with the full ImageNet dataset used as pre-training, but data-efficient learning still remains a challenge.

2.2 Self-supervision

Unsupervised learning methods for pre-training neural networks has sparked great interest in recent years. Given the large quantity of available data on the internet and the cost to rigorously annotate it, several methods have been proposed to learn general features. Most modern methods pre-train models to predict pseudo-labels on pretext tasks, to be fine-tuned on a supervised downstream task - usually with smaller amounts of data [22].

With its roots in natural language processing, one of the most successful approaches is the skip-gram method [32], which provides general semantic textual representations by predicting the preceding and succeeding tokens from a single input token. More recent developments in natural language processing show promising results with models such as BERT [38] and GPT-2 [10], which are pre-trained on a very large corpus of text to predict the next token.

Similarly, this approach has been explored for images, with works trying to generate representations by “context prediction” [11]. Authors use a pretext task to classify the relative position of patches in an image. The same principle is used in works which explore solving jigsaw puzzles as a pretext task [7, 33]. In both cases, the intuition is that a good performance on the patch classification task is directly correlated with a good performance on the downstream task, and with the network learning semantic information from the image.

Other self-supervision methods include predicting image rotations [14], image colorization [48] and even a multi-task model with several tasks at once [12]. Furthermore, exemplar networks [13] are trained to discriminate between a set of surrogate classes, to learn transformation invariant features. A more recent advancement in this area is Contrastive Predictive Coding [17], which is one of the most performing methods, for self-supervised pre-training.

An interesting multi-modal technique for self-supervision leverages a corpus of images from Wikipedia and their description [3]. The authors pre-train a network to predict the topic probabilities of the text description of an image, thereby leveraging the language context in which images appear.

Clustering techniques have also been explored [6, 20, 34] - by generating a classification pretext task with pseudo-labels based on cluster assignments. One method for unsupervised pre-training that makes very few assumptions about the input space is proposed by Bojanowski et al. [5]. This approach trains a network to align a fixed set of target representations randomly sampled from the unit sphere.

Interestingly, a study by Kolesnikov et al. [25] demonstrated that there is an inconsistency between self-supervision methods and network architectures. Some network architectures are better suited to encode image rotation, while others are better suited to handle patch information. We argue that this inconsistency also holds for datasets. These techniques show promising results on natural images, but very little research is devoted to learning good representations for document images, which have entirely different structural and semantic properties.

2.3 Document Analysis

The representation of document images has a practical interest in commercial applications for tasks such as classification, retrieval, clustering, attribute extraction and historical document processing [36]. Shallow features for representing documents [8] have proven to be less effective compared to deep features learned by a convolutional neural network. Several medium-scale datasets containing labeled document images are available, the ones used in this work being RVL-CDIP [15] and Tobacco-3482 [28]. For classification problems on document images, state-of-the-art approaches leverage domain knowledge of documents [1], combining features from the header, the footer and the contents of an image. Layout-methods are used in other works [2, 24, 47] to make use of both textual information and their visual position in the image for use in extracting semantic structure.

One study by Kölsch et al. [26] showed that pre-training networks using the RVL-CDIP dataset is better than pre-training with ImageNet in a supervised classification problem on the Tobacco-3482 dataset. Still, training from scratch is far worse than with ImageNet pre-training [41].

3 Methods

For our experiments, we implemented several methods for self-supervision and evaluated their performance on Tobacco-3482 document image classification task, where there is a limited amount of data. We implemented two Context-Free Networks (CFN), relative patch classification [11], and solving jigsaw puzzles [33], which are patch-based, and, by design, are not using the broader context of the document. We also trained a model to predict image rotations [14], as a method that could intuitively make use of the layout and an input-agnostic method developed by Bojanowski et al. [5], which forces the model to learn mappings to the input image to noise vectors that are progressively aligned with deep features. We propose variations to context-free solving of jigsaw puzzles and to rotation prediction, which improves performance. We propose Jigsaw Whole, which is a pretext task to solve jigsaw puzzles, but with the whole image given as input, and predicting flips, which is in the same spirit of predicting rotations, but better suited for document images.

We also developed a method that makes use of the information-rich textual modality: the model is tasked to predict the topic probabilities of the text present in the document using only the image as an input. This method is superior to ImageNet pre-training.

3.1 Implementation Details

Given the extensive survey of [44], we used the document images in grayscale format, resized to a fixed size of 384 \(\times \) 384. Images are scaled so that the pixels fall in the interval \((-0.5, 0.5)\), by dividing by 255 and subtracting 0.5 [23].

Shear transformations or crops are usually used to improve the performance and robustness of CNNs on document images [44]. We intentionally don’t use augmentations during training or evaluation to speed up the process and lower the experiment’s complexity.

InceptionV3 [42] architecture was used in all our experiments because of its popularity, performance and availability in common deep learning frameworks.

3.2 Jigsaw Puzzles

In the original paper for pre-training with solving Jigsaw puzzles [33], the authors propose a Context-Free Network architecture, with nine inputs, each being a crop from the original image. There, the pretext task is to reassemble the crops into the original image by predicting the permutation.

This is sensible for natural images, which are invariant to zooms: objects appear at different scales in images, and random crops could contain information that is useful when fine-tuning on the main task.

On the other hand, document images are more rigid in their structure. A random crop could end up falling in an area with blank space, or in the middle of a paragraph. Such crops contain no information for the layout of the document, and their relationship is not clear when processed independently. Moreover, text size changes relative to the crop size. As such, when fine-tuning, the text size is significantly smaller relative to the input size, which is inconsistent with the pretext task.

Fig. 1.
figure 1

Input image for our Jigsaw Whole method. The scrambled image is given as is to a single network, without using siamese branches, such that context and layout information is preserved.

We propose a new way of pre-training by solving jigsaw puzzles with convolutional networks, by keeping the layout of the document visible to the model. After splitting the image into nine crops and shuffling them, we reassemble them into a single puzzle image. An example of the model input is exemplified in Fig. 1. We name this variation Jigsaw Whole. The intuition is that the convolutional network will better learn semantic features by leveraging the context of the document as a whole.

In order to obtain the final resolution, we resized the initial image at 384 \(\times \) 384 pixels and then split it into nine crops of 128 \(\times \) 128 pixels each. Using jitter (as recommended by Noroozi et al. [33]) of 10 pixels, results in a resolution of 118 \(\times \) 118 pixels for each of the nine patches.

As described by Noroozi et al. [33], we chose only 100 out of 9! = 362880 possible permutations of the nine crops. Those were selected using a greedy algorithm to approximate the maximal average hamming distance between pairs of permutations from a set.

3.3 Relative Position of Patches

Using a similar 3 \(\times \) 3 grid as in the previous method, and based on Doersch et al. [11], we implemented a siamese network to predict which is the position of a patch relative to the square in the center. The model has two inputs (the crop in the middle and one of the crops around it), and after the Global Average Pooling layer from InceptionV3, we added a fully-connected layer with 512 neurons, activated with a rectified linear unit and then a final fully-connected layer with eight neurons and softmax activation. For fine-tuning, we kept the representations created after the Global Average Pooling layer, ignoring the added fully-connected layer. To train the siamese network, we resized all the images to 384 \(\times \) 384 pixels, then we created a grid with nine squares of 128 \(\times \) 128 pixels each, and using jitter of 10 pixels, we obtained the input crops of 118 \(\times \) 118.

Note that jitter is used both in solving jigsaw puzzles and predicting the relative position of patches, in order to prevent the network from solving the problem immediately by analysing the margins of the crops only (in which case it does not need to learn any other structural or semantic features).

Similar to solving jigsaw puzzles, predicting the relative position of patches suffer from the same problems of having too little context in a patch. We show that these methods perform poorly on document image classification.

3.4 Rotations and Flips

A recent method for self-supervision proposed by Gidaris et al. [14] is the simple task of predicting image rotations. This task works for natural images quite well, since objects have an implicit orientation, and determining the rotation requires semantic information about that object. Documents, on the other hand, have only one orientation - upright. We pre-train our network to discriminate between 4 different orientations (0\(^{\circ }\), 90\(^{\circ }\), 180\(^{\circ }\) and 270\(^{\circ }\)). It is evident that discriminating between (0\(^{\circ }\), 180\(^{\circ }\)) pair and (90\(^{\circ }\), 270\(^{\circ }\)) pair is trivial as the text lines are positioned differently. We argue that this is a shortcut for the model, and in this case, the task is not useful for learning semantic or layout information.

Instead, we propose a new method, in the same spirit, by creating a pretext task that requires the model to discriminate between different flips of the document image. This way, the more challenging scenarios from the rotations methods are kept (in which text lines are always horizontal), and we argue that this forces the model to learn layout information or more fine-grained text features in order to discriminate between flips. It is worth noting that this method does not work in the case of natural images, as they are invariant to flips, at least across the vertical axis. In our experiments, we named this variation Flips.

3.5 Multi-modal Self-supervised Pre-training

While plain computer vision methods are used with some degree of success, many applications do require textual information to be extracted from the documents. Be it the semantic structure of documents [47], or extracting attributes from financial documents [24] or table understanding [19, 31, 37], the text modality present in documents is a rich source of information that can be leveraged to obtain better document representations. We assume that the visual document structure is correlated with the textual information present in the document. Audebert et al. [2] use textual information to jointly classify documents from the RVL-CDIP dataset with significant results. Instead of jointly classifying, we explore self-supervised representation learning using text modality.

The text is extracted by an OCR engine [40] making resulting text very noisy - many words have low document frequency due to OCR mistakes. While this should not be a problem given the large amount of data in the RVL-CDIP dataset, we do clean the text by lower-casing it, replacing all numbers with a single token and discarding any non-alpha-numeric characters.

Text Topic Spaces. Using textual modality to self-train a neural network was used by Gomez et al. [3] by exploiting the semantic context present in illustrated Wikipedia articles. The authors use the topic probabilities in the text as soft labels for the images in the article. Our approach is similar - we extract text from the RVL-CDIP dataset and analyse it using Latent Dirichlet Allocation [4] to extract topics. The CNN is then trained to predict the topic distribution, given only the image of the document. Different from the approach proposed by Gomez et al. [3], there is a more intimate and direct correspondence between the document layout and its text content.

Fig. 2.
figure 2

General methodology for multi-modal self-supervision. The text from documents is extracted using an OCR engine, and then each of our method for topic modelling are used to generate topic probabilities. The neural network is then tasked to predict the topic probabilities using only the document image.

In this topic modeling method, we used soft-labels, as it was shown to improve performance in the context of knowledge distillation [18]. Soft-labels are also robust against noise [45], and have shown to increase performance for large-scale semi-supervised learning [46]. Figure 2 depicts the general overview this method. Our intuition is that documents that are similar in topic spaces should also be similar in appearance. In our experiments, we named this self-supervision method LDA Topic Spaces.

4 Experiments

For the pre-training phase of the self-supervised methods, we used the training set from RVL-CDIP [15]. RVL-CDIP is a dataset consisting of 400.000 grayscale document images, of which 320.000 are provided for training, 40.000 for validation, and the remaining 40.000 for testing. The images are labeled into 16 classes, some of which are also present in Tobacco-3482. Naturally, during our self-supervised pre-training experiments, we discard the labels. During the evaluation, we used the pre-trained models as feature extractors, and compute feature vectors for each image. We then trained a logistic classifier using L-BFGS [30]. As shown by Kolesnikov et al. [25], a linear model is sufficient for evaluating the quality of features. Images used in the extraction phase come from the Tobacco-3482 dataset and are pre-processed exactly as during training. For partitioning the dataset, we used the same method as in [1, 15, 23, 26] for consistency and for a fair comparison with other works. We used Top-1 Accuracy as a metric, and we trained on a total of 10 to 100 images per class (with ten images increment), randomly sampled from Tobacco-3482. Testing was done on the rest of the images. We ran each experiment 10 times to reduce the odds of having a favourable configuration of training samples. Our evaluation scheme is designed for testing the performance in a document image classification setting with a limited amount of data.

In the particular case of LDA, we varied the number of topics and trained three models tasked to predict the topic probabilities of 16, 32, and 64 topics. By using this form of soft-clustering in the topic space, the model benefited from having a finer-grained topic distribution.

In our experiments, we used two supervised benchmarks: a model pre-trained on ImageNet and a model pre-trained on RVL-CDIP. The supervised pre-training methods have an obvious advantage, due to the high amount of consistent and correct information present in annotations. Consistent with other works [26], supervised RVL-CDIP pre-training is far superior.

Fig. 3.
figure 3

Performance on fine-tuning for some of the more relevant methods on different sample sizes from Tobacco-3482. Our proposed methods have higher accuracy than previous attempts on natural images. Multi-modal self-supervision using LDA (LDA Topic Spaces) with 64 topics is significantly higher than supervised ImageNet pre-training and much higher than the other self-supervised methods we tested.

5 Results

In Fig. 3, we show some of the more relevant methods in the evaluation scheme. Supervised RVL-CDIP is, unsurprisingly, the most performing method, and our self-supervised multi-modal approach has a significantly higher accuracy overall when compared to supervised ImageNet pre-training. Features extracted from patch-based methods and methods which rely only on layout information are not discriminative enough to have higher accuracy. This is also consistent with the original works [5, 11, 33] in which self-supervised pre-training did not provide a boost in performance compared to the supervised baseline.

Relative Patches and Jigsaw Puzzles have only modest performance. Both these methods initially used a “context-free” approach. Our variation of Jigsaw Puzzles - Jigsaw Whole - works around this by actually including more context. Receiving the entire document helps the network to learn features that are relevant for the layout. Features learned this way are more discriminative for the classes in Tobacco-3842. In the case of Relative Patches, there is no sensible way to include more context in the input, as stitching together two patches changes the aspect ratio of the input.

In Fig. 3, we present the mean accuracy for 100 samples per class on all methods. We also implemented the work of Bojanowski et al. [5], to pre-train a model by predicting noise. This method is very general and assumes very little of the inputs. We discovered that it was better than predicting rotations. The features extracted by predicting rotations were, in fact, the weakest, as this task is far too easy for a model in the case of document images. Our variation, predicting flips, provides a much harder task, which translates into a better performance on the classification task (Table 1).

Table 1. Results on our pre-training experiments. We also implemented Noise as Targets (NAT) [5] as an input-agnostic method for self-supervision. For LDA, we tried multiple number of topics and have decided upon 64 before experiencing diminishing returns. All presented methods employ self-supervised pre-training, unless otherwise specified.

In the case of topic modeling, we argue that the boost in performance is due to the high correlation between the similarity in the topic space and the similarity in the “image” space. The 16 classes in RVL-CDIP are sufficiently dissimilar between them in terms of topics (i.e., news article, resume, advertisement), and each class of documents has a specific layout. Surprisingly, LDA with 16 topics was the weakest. A finer-grained topic distribution helped the model in learning more discriminative features.

Fig. 4.
figure 4

Gradient ascent visualization of filters learned through supervised RVL pre-training. Patterns for text lines, columns and paragraphs appear.

5.1 Qualitative Analysis

For the qualitative analysis, we compare filters learned through LDA self - supervision with those learned through RVL-CDIP supervised pre-training. In Figs. 4 and 5, we show gradient ascent visualizations learned by both methods - a randomly initialized input image is progressively modified such that the mean activation of a given filter is maximized. The filters shown are from increasing depths in the InceptionV3 network, from conv2d_10 to conv2d_60.

In Fig. 4 there are clear emerging patters that correspond to text lines, columns and paragraphs - general features that apply to a large subset of documents.

Fig. 5.
figure 5

Gradient ascent visualization of filters learned by LDA pre-training. Activation patterns that correspond to words and paragraphs emerge. More distinctively, patterns that appear to resemble words are more frequent than in the supervised setting.

In Fig. 5, we show filters learned by LDA self - supervision. In contrast to the “gold-standard” filter learned by RVL-CDIP supervision, the patterns that emerge here are frequently more akin to what appears to be words. This is a direct consequence of the way LDA constructs topics, in which a topic is based on a bag-of-words model. Our neural network, therefore, has a high response in regions corresponding to particular words in the image. The features learned this way are nonetheless discriminative for document images, as patterns for paragraphs and text columns still emerge. Another particularity of these filters is that they are noisier than those learned by direct supervision. This likely results from the soft and noisy labels generated by the topic model, and due to imperfect text extracted by the OCR engine. Naturally, features learned from ImageNet pre-training are qualitatively different and more general from filters that are specialized in extracting information from document images. See Olah et al. [35] for a comprehensive visualization of InceptionV3 trained on ImageNet.

6 Conclusions

We have explored self-supervision methods that were previously introduced in the realm of natural images and showed that document images have a more rigid visual structure, which makes patch-based methods not work as well on them. To that end, we proposed slight alterations that exploit documents’ visual properties: self-supervised pre-training by predicting flips and by solving jigsaw puzzles with the whole layout present in the input.

Documents are inherently multi-modal. As such, by extracting text from the document images, we developed a method to pre-train a network, in a self-supervised manner, to predict topics generated by Latent Dirichlet Allocation. This method outperforms the strong baseline of supervised pre-training from ImageNet. We also show that the features learned this way are closely related to those learned through direct supervision on RVL-CDIP, making this method a viable method for pre-training neural networks on document images.