Skip to main content
Log in

Recognition Algorithms for Structured Documents with Variable Content

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

The paper deals with a wide class of structured documents that cannot be described using one or several models based on associations between the document fields and geometric elements. A formal model of such documents is described that is based on the concept of a multiset. Examples of structured documents of this class are given and a technique for the construction of models of structured documents is proposed. This technique is illustrated using an implementation of an automated document management system. Implemented algorithms for detecting document fields are described, and implementation problems are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

REFERENCES

  1. Porai, D.S., Document Processing as a Basis for the Construction of Information Systems, in Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2002, pp. 265–278.

    Google Scholar 

  2. Date, C.J., An Introduction to Database Systems, Reading, Mass.: Addison-Wesley, 1995. Translated under the title Vvedenie v sistemy baz dannykh, Moscow: Dialektika, 1998.

    Google Scholar 

  3. Postnikov, V.V., A Formal Approach to the Identification Problem for Graphical Images of Structured Documents, in Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk, Razvitie bezbumazhnykh tekhnologii v organizatsionnykh sistemakh (Proc. of the Institute of System Analysis, Russian Academy of Sciences, Development of Paper-free Technologies in Office Systems), Moscow: URSS, 1999, pp. 280–299.

    Google Scholar 

  4. Garri, D., Correlated Run Length Algorithm (CURL) for Detecting Form Structure within Digitized Documents, in Third Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 415–424.

  5. Wang, D. and Shrihari, S.N., Analysis of Form Images, Int. J. Pattern Recogn. Artif. Intell., 1994, vol. 8, no.5, pp. 1031–1052.

    Article  Google Scholar 

  6. Makino, H., Representation and Segmentation of Document Images, in Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 1983, pp. 291–296.

  7. Kida, H., Iwaki, O., and Kawada, K., Document Recognition System for Office Automation, in Proc. Eighth Int. Conf. on Pattern Recognition, 1986, pp. 446–448.

  8. Nagy, G., Seth, S.C., and Stoddard, S.D., Document Analysis with an Expert System, in Proc. Pattern Recognition in Practice II, Amsterdam, 1985.

  9. Esposito, F., Malerba, D., Semeraro, G., Annese, E., and Scafuro, G., An Experimental Page Layout Recognition System for Office Document Automatic Classification: An Integrated Approach for Inductive Generalization, in Proc. Tenth Int. Conf. on Pattern Recognition, 1990, pp. 557–562.

  10. Ivanov, Yu.N., Emel’yanov, N.E., and Sotnikova, R.A., Documents: Types and Description, Preprint of All-Union Research Institute of System Studies, Moscow, 1987.

  11. Petrovsii, A.B., Prostranstva mnozhestv i multimnozhestv (Spaces of Sets and Multisets), Moscow, URSS, 2003.

    Google Scholar 

  12. Kuznetsov, O.P. and Adel’son-Vel’skii, G.M., Diskretnaya matematika dlya inzhenera (Discrete Mathematics for Engineers), Moscow: Energoatomizdat, 1988.

    Google Scholar 

  13. Arlazarov, V.L., Loginov, A.S., and Slavin, O.A., Characteristics of Optical Character Recognition Software, Programmirovanie, 2002, no. 3, pp. 45–63.

  14. Arlazarov, V.L., Kuratov, P.A., and Slavin, O.A., Word model-Driven Segmentation of Character Boundaries, in Organizatsionnoe upravlenie i iskusstvennyi intellekt, Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Administration and Artificial Intelligence, Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2003, pp. 176–184.

    Google Scholar 

  15. Arlazarov, V.L. and Slavin, O.A., Recognition Algorithms and Text Input Technologies, Information Technologies and Computer Systems, 1996, no. 1, pp. 48–54.

  16. Postnikova, M.V. and Slavin, O.A., The Concept of Electronic Document Management: Example of the Document Management System in a Patent Department, in Organizatsionnoe upravlenie i iskusstvennyi intellekt, Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Administration and Artificial Intelligence, Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2003, pp. 30–51.

    Google Scholar 

  17. Dudushkin, S.V., Document Management in Legal Circles (an interview), Intelligent Enterprise, 2004, vol. 91, no.2.

  18. Arlazarov, V.V., Postnikov, V.V., and Sholomov, D.S., Cognitive Forms: A System of Mass Document Inputting, in Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2002, pp. 35–47.

    Google Scholar 

  19. Fail, C.J., Torcsvari, A., Benzineb, K., and Karetka, G., Automated Categorization in the International Patent Classification, Proc. ACM SIGIR Forum, 2003, vol. 37, no.1, pp. 10–25.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

__________

Translated from Programmirovanie, Vol. 31, No. 4, 2005.

Original Russian Text Copyright © 2005 by Slavin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Slavin, O.A. Recognition Algorithms for Structured Documents with Variable Content. Program Comput Soft 31, 211–223 (2005). https://doi.org/10.1007/s11086-005-0033-5

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11086-005-0033-5

Keywords