Skip to main content
Log in

Multi-modal page stream segmentation with convolutional neural networks

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As first steps, the workflow usually involves batch scanning and optical character recognition (OCR) of documents. In the case of multi-page documents, the preservation of document contexts is a major requirement. To facilitate workflows involving very large amounts of paper scans, page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into coherent multi-page documents. In a digitization project together with a German federal archive, we developed a novel approach for PSS based on convolutional neural networks (CNN). As a first project, we combine visual information from scanned images with semantic information from OCR-ed texts for this task. The multi-modal combination of features in a single classification architecture allows for major improvements towards optimal document separation. Further to multimodality, our PSS approach profits from transfer-learning and sequential page modeling. We achieve accuracy up to 95% on multi-page documents on our in-house dataset and up to 93% on a publicly available dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The task is also referred to as Document Flow Segmentation, Document Stream Segmentation, or Document Separation.

  2. Actually, it would be preferred to compare our system against other approaches from the scientific literature directly. Unfortunately, we neither encountered a ready-to-use implementation of any text-based PSS system nor a common, public-domain dataset for the task. Our archive dataset is of such heterogeneity (see Sect. 1) that we refrained from manual engineering of descriptor features. Instead, we opted for a strong baseline of a machine learning-based system which does not require extensive feature engineering and has successfully been used in related works (see Table 1).

  3. In previous experiments, we showed that higher image resolutions lead to better results in PSS. At the same time, performance slows down drastically. We choose the final image size as the largest input image size for a pre-trained VGG16 network from our transfer learning setup (see Sect. 4).

  4. They utilize the Tobacco 3482 dataset consisting of pages manually tagged with 10 different document categories (Kumar et al. 2014). The dataset is widely used in DIC research. Since it does not contain multi-page documents, it is not suitable for PSS.

  5. We use the Liblinear library by Fan et al. (2008).

  6. We refrain from using image features in this architecture because pixel features are not supposed to be linearly separable. First experiments confirmed that simple pixel features do not contribute discriminative information on top of text features to the linear SVM for our task. Of course, we could use a different SVM kernel for image classification. But, very likely we would lose the advantage of computational speed. Due to this, we stick to text features for our baseline method.

  7. We utilize inverse probability weighting on the training data to put more weight on the minority class (new document) during loss calculation.

  8. Actually, there is a large variety of unsupervised topic models as well as many other methods to reduce sparse, high-dimensional text data to a dense, lower-dimensional space (e.g. latent semantic analysis, Deerwester et al. (1990)). For our baseline system, we stick to LDA as the seminal and most widely-used topic model.

  9. The hyperparameters of hidden layer units and dropout rate have been obtained by hyper-parameter tuning w.r.t. the validation set.

  10. We also tested ensemble stacking with a logistic regression classifier but did not achieve better performance.

  11. As for the MLP in early fusion, optimal weighting values for i, j, and k have been obtained via optimization w.r.t. the validation set.

  12. The performance of neural network classification, in general, is not entirely deterministic due random initialization of layer weights and shuffling of mini-batches during training. To allow for a fair comparison of different CNN architectures, we repeated each of the experiments 10 times and report average results in Table 3.

  13. Landis and Koch (1977) specify Cohen’s kappa values above 0.4 as moderate agreement, and values above 0.6 as substantial agreement.

  14. Early experiments showed a slightly worse overall performance of our architecture when applying class-specific loss weights, although we have a rather high class imbalance between SD and ND pages.

  15. All our experiments regarding multi-modal PSS with CNN architectures can be found on this Github repository: https://github.com/uhh-lt/pss-lrev.

References

  • Agin, O., Ulas, C., Ahat, M., & Bekar, C. (2015). An approach to the segmentation of multi-page document flow using binary classification. In Proceedings of the 6th international conference on graphic and image processing, https://doi.org/10.1117/12.2178778.

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. URL http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf.

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  • Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 1724–1734). Association for Computational Linguistics, https://doi.org/10.3115/v1/D14-1179. URL http://aclweb.org/anthology/D14-1179.

  • Daher, H., & Belaïd (2014). Document flow segmentation for business applications. In Proceedings of document recognition and retrieval XXI (pp. 9201–9215). San Francisco, France, URL https://hal.archives-ouvertes.fr/hal-00926615.

  • Daher, H., Bouguelia, M. R., Belaid, A., & DAndecy, V. P. (2014). Multipage administrative document stream segmentation. In Proceedings of the 22nd International Conference on Pattern Recognition (pp. 966–971). https://doi.org/10.1109/ICPR.2014.176.

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. URL http://jmlr.org/papers/volume9/fan08a/fan08a.pdf.

  • Gallo, I., Noce, L., Zamberletti, A., & Calefati, A. (2016). Deep neural networks for page stream segmentation and classification. In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (pp 1–7), 2016. https://doi.org/10.1109/DICTA.2016.7797031.

  • Gordo, A., Rusinol, M., Karatzas, D., & Bagdanov, A. D. (2013). Document classification and page stream segmentation for digital mailroom applications. In Proceedings of the 12th International Conference on Document Analysis and Recognition (pp 621–625). https://doi.org/10.1109/ICDAR.2013.128.

  • Hamdi, A., Voerman, J., Coustaty, M., Joseph, A., d’Andecy, V. P., & Ogier, J. M. (2017). Machine learning vs deterministic rule-based system for document stream segmentation. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (pp 77–82). https://doi.org/10.1109/ICDAR.2017.332.

  • Hamdi, A., Coustaty, M., Joseph, A., d’Andecy, V. P., Doucet, A., & Ogier, J. M. (2018). Feature selection for document flow segmentation. In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (pp 245–250). https://doi.org/10.1109/DAS.2018.66.

  • Harley, A. W., Ufkes, A., & Derpanis, K. G. (2015). Evaluation of deep convolutional nets for document image classification and retrieval. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR) (pp. 991–995). https://doi.org/10.1109/ICDAR.2015.7333910.

  • Isemann, D., Niekler, A., Preßler, B., Viereck, F., & Heyer, G. (2014). OCR of legacy documents as a building block in industrial disaster prevention. In Proceedings of the DIMPLE@LREC Workshop on Disaster Management and Principled Large-scale information Extraction for and post emergency logistics

  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (pp. 137–142). Berlin: Springer. ISBN 978-3-540-69781-7.

  • Karpinski, R., & Belaïd, A. (2016). Combination of structural and factual descriptors for document stream segmentation. In Proceedings of the 12th IAPR Workshop on Document Analysis Systems (pp. 221–226). https://doi.org/10.1109/DAS.2016.21.

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1746–1751). Association for Computational Linguistics, https://doi.org/10.3115/v1/D14-1181. URL http://aclweb.org/anthology/D14-1181.

  • Kumar, J., Ye, P., & Doermann, D. (2014). Structural similarity for document image classification and retrieval. Pattern Recognition Letters, 43, 119–126.

    Article  Google Scholar 

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. ISSN 0006341X, 15410420. URL http://www.jstor.org/stable/2529310.

  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning, (pp. 1188–1196).

  • Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., & Heard, J. (2006). Building a test collection for complex document information processing. In Proceedings of the 29th Annual International ACM SIGIR Conference (pp. 665–666).

  • Meilender, T., & Belaïd, A. (2009). Segmentation of continuous document flow by a modified backward- forward algorithm. In SPIE - Electronic Imaging, Los Angeles, USA, URL https://hal.inria.fr/inria-00347217.

  • Niekler, A., & Jähnichen, P. (2012). Matching results of latent dirichlet allocation for text. In Proceedings of the 11th International Conference on Cognitive Modeling (pp. 317–322). Universitätsverlag der TU Berlin.

  • Noce, L., Gallo, I., Zamberletti, A., & Calefati, A. (2016). Embedded textual content for document image classification with convolutional neural networks. In Proceedings of the 2016 ACM Symposium on Document Engineering (pp. 165–173). ACM: New York. ISBN 978-1-4503-4438-8. https://doi.org/10.1145/2960811.2960814.

  • Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1), 62–66. https://doi.org/10.1109/tsmc.1979.4310076.

    Article  Google Scholar 

  • Phan, X. H., Nguyen, C. T., Le, D. T., Nguyen, L. M., & Horiguchi, S. (2011). A hidden topic-based framework toward building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering, 23(7), 961–976. https://doi.org/10.1109/TKDE.2010.27.

    Article  Google Scholar 

  • Rusiñol, M., Frinken, V., Karatzas, D., Bagdanov, A. D., & Lladós, J. (2014). Multimodal page classification in administrative document image streams. International Journal on Document Analysis and Recognition, 17(4), 331–341. https://doi.org/10.1007/s10032-014-0225-8.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:1409.1556, URL http://arxiv.org/abs/1409.1556.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Aidan N., K., Ł ukasz, & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30 (pp. 5998–6008). Curran Associates, URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

  • Wiedemann, G. (2019). Proportional classification revisited: Automatic content analysis of political manifestos using active learning. Social Science Computer Review, 37(2), 135–159. https://doi.org/10.1177/0894439318758389.

    Article  Google Scholar 

  • Wiedemann, G., Ruppert, E., Jindal, R., & Biemann, C. (2018). Transfer learning from LDA to BiLSTM-CNN for offensive language detection in Twitter. In Proceedings of GermEval Task 2018, 14th Conference on Natural Language Processing (Konvens) (pp. 85–94). Vienna, Austria: Austrian Academy of Sciences.

Download references

Acknowledgements

This work has been realized at the University of Leipzig (Germany) in the joint research project “Knowledge Management of Legacy Documents in Science, Administration and Industry” together with the Helmholtz Research Centre for Environmental Health in Munich and the CID GmbH, Freigericht. The authors thank colleagues at Helmholtz and CID for their valuable support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gregor Wiedemann.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wiedemann, G., Heyer, G. Multi-modal page stream segmentation with convolutional neural networks. Lang Resources & Evaluation 55, 127–150 (2021). https://doi.org/10.1007/s10579-019-09476-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09476-2

Keywords

Navigation