Classification of business documents for real-time application

Gaceb, Djamel; Eglin, Véronique; Lebourgeois, Frank

doi:10.1007/s11554-011-0227-4

Classification of business documents for real-time application

Original Research Paper
Published: 30 November 2011

Volume 9, pages 329–345, (2014)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Djamel Gaceb¹,
Véronique Eglin¹ &
Frank Lebourgeois¹

398 Accesses
6 Citations
Explore all metrics

Abstract

In this paper, we present a new document classification based on physical layout features and graph b-coloring modeling. In order to reduce the computing time and to increase the performance of our automatic reading system, we propose to pre-classify the business documents by introducing an Automatic Recognition of Documents stage as a pre-analysis phase. This phase guides others involved in the recognition process of the documents contents. Once the document type is identified, the reading system will use its corresponding information source to improve the recognition of its logical layout, the selection and parameterization of the OCR, and the final decision of sorting. The graph coloring model is introduced for both layout analysis and document classification. The proposed method is reliable, robust to various constraints and guarantees a real-time answer to the sorting of business documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Mullot, R.: Les documents écrits de la numérisation à l’indexation par le contenu, pp. 365. Hermes Science Publication, Paris (2006)
Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure based features. Int. J. Doc. Anal. Recognit. 3(4), 232–247 (2001)
Article Google Scholar
Heroux, P., Diana, S., Ribert, A., Trupin, E.: Classification method study for automatic form class identification. In: The 14th ICPR, Brisbane, Australia, pp. 926–929 (1998)
Esposito, F., Malerba D, Lisi, F.A.: Machine learning for intelligent processing of printed documents. J. Intell. Inf. Syst. 14(2-3), 175–198 (2000)
Google Scholar
Cesarini, F., Lastri, M., Marinai, S., Soda, G.: Encoding of modified X–Y trees for document classification. In: 6th ICDAR’01, pp. 1131–1136 (2001)
Baldi S., Marinai S., Soda G.: Using tree-grammars for training set expansion in page classification. In: 7th ICDAR’03, pp. 829–833 (2003)
Diligenti, M., Frasconi, P., Gori, M.: Hidden tree Markov models for document image classification. IEEE Trans. Pattern Anal. Mach. Intell 25(4), 519–523 (2003)
Article Google Scholar
Bagdanov, A.D., Worring, M.: First order Gaussian graphs for efficient structure classification. Pattern Recognit 36(6), 1311–1324 (2003)
Article MATH Google Scholar
Dengel A., Dubiel, F.: Computer understanding of document structure. Int. J. Imaging Syst. Technol. 7, 271–278 (1996)
Google Scholar
Eglin, V., Bres, S.: Document page similarity based on layout visual saliency: application to query by example and document classification. In: The 7th ICDAR, Scotland, pp. 1208–1212 (2003)
Brugger, R., Zramdini, A., Ingold, R.: Modeling documents for structure recognition using generalized n-grams. In: 4th International Conference on Document Analysis and Recognition, ICDAR’97, vol. 1, pp 56–60 (1997)
Kochi T., Saitoh, T.: User-defined template for identifying document type and extracting information from documents. In: Proceedings of the 5th International Conference on Document Analysis and Recognition, Bangalore, India, 20–22 September 1999, pp. 127–130
Nattee, C., Numao, M.: Geometric method for document understanding and classification using on-line machine learning. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 602–606
Liang, J., Doermann, D., Ma, M., Guo, J.K.: Page classification through logical labelling. In: Proceedings of the 16th International Conference on Pattern Recognition, Quebec, Canada, 11–15 August 2002, pp. 477–480
Yang Y., Liu X.: A re-examination of text categorization methods. In: Proceedings of the 22nd ACM SIGIR Conference, pp. 42–49 (1999)
Yang, J., Wang, S.: Extended VSM for XML document classification using frequent subtrees. In: Focused retrieval and evaluation. Lecture Notes in Computer Science, vol. 6203, pp. 441–448 (2010)
Lewis, D.D., Ringuetee, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Mohamed, H.K.: Automatic documents classification. In: IEEE ICCES’07, pp. 33–37
Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form type identification and form-data recognition. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 926–930
Liang, J., Doermann, D.S.: Logical labeling of document images using layout graph matching with adaptive learning source lecture notes. In: Computer Science; Archive Proceedings of the 5th International Workshop on Document Analysis Systems V (DAS), vol. 2423, pp. 224–235 (2002) (ISBN:3-540-44068-2)
Effantin, B., Kheddouci, H.: A distributed algorithm for a b-coloring of a graph. In: IEEE ISPA’2006, Serrento, Italy (2006)
Paschos, V.: Optimisation combinatoire5: problèmes paradigmatiques et nouvelles problématiques, p. 270. Lavoisier, France (2007)
Google Scholar
Gaceb, D.J., Eglin, V.: Improvement of postal mail sorting system. Int. J. Doc. Anal. Recognit. 11(2),67–80 (2008)
Google Scholar
Elghazel H., Hacid, M., Khedouci, H., Dussauchoy, A.: A new clustering approach for symbolic data: algorithms and application to healthcare data. BDA 2006, Lille, France
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. SMC 9(1), 62–66 (1979)
MathSciNet Google Scholar
Sauvola, J., et al.: Adaptive document binarisation. In: Document Analysis and Recognition, ICDAR, Proceedings of the Fourth International Conference, 18–20 August 1997, vol. 1, pp. 147–152
Gaceb, D.J., Eglin, V.: Address block localization based on graph theory. In: DRR XIV, SPIE, USA, pp. 12 (2008)
Pavlidis, T.: Structural Pattern Recognition, vol. 1, p. 302. Springer, Berlin (1977)
MATH Google Scholar
Drivas, D., Amin, A.: Page segmentation and classification utilising a bottom-up approach. In: Document Analysis and Recognition, ICDAR, Proceedings of the Third International Conference, vol. 2, pp. 610–614 (1995)
Shi, Z., Govindaraju, V.: Line separation for complex document images using fuzzy runlength. In: Document Image Analysis for Libraries, DIAL 2004, Proceedings, First International Workshop, pp. 306–312 (2004)
Déforges, O., Barba, D.: A fast multiresolution text-line and non text line structures extraction and discrimination scheme for document image analysis, ICPR 94, pp. 134–138 (1994)
Pavlidis, Z., Zhou, J.: A page segmentation and classification. CVGIP 54(6), 484–496 (1992)
Google Scholar
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: 8th International Conference on Computer Vision, July 2001, pp. 416–423

Download references

Author information

Authors and Affiliations

LIRIS INSA de Lyon, 20, Av. Albert Einstein, 69621, Villeurbanne Cedex, France
Djamel Gaceb, Véronique Eglin & Frank Lebourgeois

Authors

Djamel Gaceb
View author publications
You can also search for this author in PubMed Google Scholar
Véronique Eglin
View author publications
You can also search for this author in PubMed Google Scholar
Frank Lebourgeois
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Djamel Gaceb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gaceb, D., Eglin, V. & Lebourgeois, F. Classification of business documents for real-time application. J Real-Time Image Proc 9, 329–345 (2014). https://doi.org/10.1007/s11554-011-0227-4

Download citation

Received: 29 January 2010
Accepted: 03 October 2011
Published: 30 November 2011
Issue Date: June 2014
DOI: https://doi.org/10.1007/s11554-011-0227-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classification of business documents for real-time application

Abstract

Access this article

Similar content being viewed by others

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Automatic feature recognition from STEP file for smart manufacturing

Creating a Medical Imaging Workflow Based on FHIR, DICOMweb, and SVG

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classification of business documents for real-time application

Abstract

Access this article

Similar content being viewed by others

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Automatic feature recognition from STEP file for smart manufacturing

Creating a Medical Imaging Workflow Based on FHIR, DICOMweb, and SVG

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation