Skip to main content
Log in

Classification of business documents for real-time application

  • Original Research Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

In this paper, we present a new document classification based on physical layout features and graph b-coloring modeling. In order to reduce the computing time and to increase the performance of our automatic reading system, we propose to pre-classify the business documents by introducing an Automatic Recognition of Documents stage as a pre-analysis phase. This phase guides others involved in the recognition process of the documents contents. Once the document type is identified, the reading system will use its corresponding information source to improve the recognition of its logical layout, the selection and parameterization of the OCR, and the final decision of sorting. The graph coloring model is introduced for both layout analysis and document classification. The proposed method is reliable, robust to various constraints and guarantees a real-time answer to the sorting of business documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  1. Mullot, R.: Les documents écrits de la numérisation à l’indexation par le contenu, pp. 365. Hermes Science Publication, Paris (2006)

  2. Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure based features. Int. J. Doc. Anal. Recognit. 3(4), 232–247 (2001)

    Article  Google Scholar 

  3. Heroux, P., Diana, S., Ribert, A., Trupin, E.: Classification method study for automatic form class identification. In: The 14th ICPR, Brisbane, Australia, pp. 926–929 (1998)

  4. Esposito, F., Malerba D, Lisi, F.A.: Machine learning for intelligent processing of printed documents. J. Intell. Inf. Syst. 14(2-3), 175–198 (2000)

    Google Scholar 

  5. Cesarini, F., Lastri, M., Marinai, S., Soda, G.: Encoding of modified X–Y trees for document classification. In: 6th ICDAR’01, pp. 1131–1136 (2001)

  6. Baldi S., Marinai S., Soda G.: Using tree-grammars for training set expansion in page classification. In: 7th ICDAR’03, pp. 829–833 (2003)

  7. Diligenti, M., Frasconi, P., Gori, M.: Hidden tree Markov models for document image classification. IEEE Trans. Pattern Anal. Mach. Intell 25(4), 519–523 (2003)

    Article  Google Scholar 

  8. Bagdanov, A.D., Worring, M.: First order Gaussian graphs for efficient structure classification. Pattern Recognit 36(6), 1311–1324 (2003)

    Article  MATH  Google Scholar 

  9. Dengel A., Dubiel, F.: Computer understanding of document structure. Int. J. Imaging Syst. Technol. 7, 271–278 (1996)

    Google Scholar 

  10. Eglin, V., Bres, S.: Document page similarity based on layout visual saliency: application to query by example and document classification. In: The 7th ICDAR, Scotland, pp. 1208–1212 (2003)

  11. Brugger, R., Zramdini, A., Ingold, R.: Modeling documents for structure recognition using generalized n-grams. In: 4th International Conference on Document Analysis and Recognition, ICDAR’97, vol. 1, pp 56–60 (1997)

  12. Kochi T., Saitoh, T.: User-defined template for identifying document type and extracting information from documents. In: Proceedings of the 5th International Conference on Document Analysis and Recognition, Bangalore, India, 20–22 September 1999, pp. 127–130

  13. Nattee, C., Numao, M.: Geometric method for document understanding and classification using on-line machine learning. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 602–606

  14. Liang, J., Doermann, D., Ma, M., Guo, J.K.: Page classification through logical labelling. In: Proceedings of the 16th International Conference on Pattern Recognition, Quebec, Canada, 11–15 August 2002, pp. 477–480

  15. Yang Y., Liu X.: A re-examination of text categorization methods. In: Proceedings of the 22nd ACM SIGIR Conference, pp. 42–49 (1999)

  16. Yang, J., Wang, S.: Extended VSM for XML document classification using frequent subtrees. In: Focused retrieval and evaluation. Lecture Notes in Computer Science, vol. 6203, pp. 441–448 (2010)

  17. Lewis, D.D., Ringuetee, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

  18. Mohamed, H.K.: Automatic documents classification. In: IEEE ICCES’07, pp. 33–37

  19. Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form type identification and form-data recognition. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 926–930

  20. Liang, J., Doermann, D.S.: Logical labeling of document images using layout graph matching with adaptive learning source lecture notes. In: Computer Science; Archive Proceedings of the 5th International Workshop on Document Analysis Systems V (DAS), vol. 2423, pp. 224–235 (2002) (ISBN:3-540-44068-2)

  21. Effantin, B., Kheddouci, H.: A distributed algorithm for a b-coloring of a graph. In: IEEE ISPA’2006, Serrento, Italy (2006)

  22. Paschos, V.: Optimisation combinatoire5: problèmes paradigmatiques et nouvelles problématiques, p. 270. Lavoisier, France (2007)

    Google Scholar 

  23. Gaceb, D.J., Eglin, V.: Improvement of postal mail sorting system. Int. J. Doc. Anal. Recognit. 11(2),67–80 (2008)

    Google Scholar 

  24. Elghazel H., Hacid, M., Khedouci, H., Dussauchoy, A.: A new clustering approach for symbolic data: algorithms and application to healthcare data. BDA 2006, Lille, France

  25. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. SMC 9(1), 62–66 (1979)

    MathSciNet  Google Scholar 

  26. Sauvola, J., et al.: Adaptive document binarisation. In: Document Analysis and Recognition, ICDAR, Proceedings of the Fourth International Conference, 18–20 August 1997, vol. 1, pp. 147–152

  27. Gaceb, D.J., Eglin, V.: Address block localization based on graph theory. In: DRR XIV, SPIE, USA, pp. 12 (2008)

  28. Pavlidis, T.: Structural Pattern Recognition, vol. 1, p. 302. Springer, Berlin (1977)

    MATH  Google Scholar 

  29. Drivas, D., Amin, A.: Page segmentation and classification utilising a bottom-up approach. In: Document Analysis and Recognition, ICDAR, Proceedings of the Third International Conference, vol. 2, pp. 610–614 (1995)

  30. Shi, Z., Govindaraju, V.: Line separation for complex document images using fuzzy runlength. In: Document Image Analysis for Libraries, DIAL 2004, Proceedings, First International Workshop, pp. 306–312 (2004)

  31. Déforges, O., Barba, D.: A fast multiresolution text-line and non text line structures extraction and discrimination scheme for document image analysis, ICPR 94, pp. 134–138 (1994)

  32. Pavlidis, Z., Zhou, J.: A page segmentation and classification. CVGIP 54(6), 484–496 (1992)

    Google Scholar 

  33. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: 8th International Conference on Computer Vision, July 2001, pp. 416–423

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Djamel Gaceb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gaceb, D., Eglin, V. & Lebourgeois, F. Classification of business documents for real-time application. J Real-Time Image Proc 9, 329–345 (2014). https://doi.org/10.1007/s11554-011-0227-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-011-0227-4

Keywords

Navigation