Abstract
The paper deals with a wide class of structured documents that cannot be described using one or several models based on associations between the document fields and geometric elements. A formal model of such documents is described that is based on the concept of a multiset. Examples of structured documents of this class are given and a technique for the construction of models of structured documents is proposed. This technique is illustrated using an implementation of an automated document management system. Implemented algorithms for detecting document fields are described, and implementation problems are discussed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.REFERENCES
Porai, D.S., Document Processing as a Basis for the Construction of Information Systems, in Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2002, pp. 265–278.
Date, C.J., An Introduction to Database Systems, Reading, Mass.: Addison-Wesley, 1995. Translated under the title Vvedenie v sistemy baz dannykh, Moscow: Dialektika, 1998.
Postnikov, V.V., A Formal Approach to the Identification Problem for Graphical Images of Structured Documents, in Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk, Razvitie bezbumazhnykh tekhnologii v organizatsionnykh sistemakh (Proc. of the Institute of System Analysis, Russian Academy of Sciences, Development of Paper-free Technologies in Office Systems), Moscow: URSS, 1999, pp. 280–299.
Garri, D., Correlated Run Length Algorithm (CURL) for Detecting Form Structure within Digitized Documents, in Third Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 415–424.
Wang, D. and Shrihari, S.N., Analysis of Form Images, Int. J. Pattern Recogn. Artif. Intell., 1994, vol. 8, no.5, pp. 1031–1052.
Makino, H., Representation and Segmentation of Document Images, in Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 1983, pp. 291–296.
Kida, H., Iwaki, O., and Kawada, K., Document Recognition System for Office Automation, in Proc. Eighth Int. Conf. on Pattern Recognition, 1986, pp. 446–448.
Nagy, G., Seth, S.C., and Stoddard, S.D., Document Analysis with an Expert System, in Proc. Pattern Recognition in Practice II, Amsterdam, 1985.
Esposito, F., Malerba, D., Semeraro, G., Annese, E., and Scafuro, G., An Experimental Page Layout Recognition System for Office Document Automatic Classification: An Integrated Approach for Inductive Generalization, in Proc. Tenth Int. Conf. on Pattern Recognition, 1990, pp. 557–562.
Ivanov, Yu.N., Emel’yanov, N.E., and Sotnikova, R.A., Documents: Types and Description, Preprint of All-Union Research Institute of System Studies, Moscow, 1987.
Petrovsii, A.B., Prostranstva mnozhestv i multimnozhestv (Spaces of Sets and Multisets), Moscow, URSS, 2003.
Kuznetsov, O.P. and Adel’son-Vel’skii, G.M., Diskretnaya matematika dlya inzhenera (Discrete Mathematics for Engineers), Moscow: Energoatomizdat, 1988.
Arlazarov, V.L., Loginov, A.S., and Slavin, O.A., Characteristics of Optical Character Recognition Software, Programmirovanie, 2002, no. 3, pp. 45–63.
Arlazarov, V.L., Kuratov, P.A., and Slavin, O.A., Word model-Driven Segmentation of Character Boundaries, in Organizatsionnoe upravlenie i iskusstvennyi intellekt, Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Administration and Artificial Intelligence, Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2003, pp. 176–184.
Arlazarov, V.L. and Slavin, O.A., Recognition Algorithms and Text Input Technologies, Information Technologies and Computer Systems, 1996, no. 1, pp. 48–54.
Postnikova, M.V. and Slavin, O.A., The Concept of Electronic Document Management: Example of the Document Management System in a Patent Department, in Organizatsionnoe upravlenie i iskusstvennyi intellekt, Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Administration and Artificial Intelligence, Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2003, pp. 30–51.
Dudushkin, S.V., Document Management in Legal Circles (an interview), Intelligent Enterprise, 2004, vol. 91, no.2.
Arlazarov, V.V., Postnikov, V.V., and Sholomov, D.S., Cognitive Forms: A System of Mass Document Inputting, in Sbornik trudov Instituta Sistemnykh Issledovanii, Ross. Akad. Nauk (Proc. of the Institute of System Analysis, Russian Academy of Sciences), Moscow: URSS, 2002, pp. 35–47.
Fail, C.J., Torcsvari, A., Benzineb, K., and Karetka, G., Automated Categorization in the International Patent Classification, Proc. ACM SIGIR Forum, 2003, vol. 37, no.1, pp. 10–25.
Author information
Authors and Affiliations
Additional information
__________
Translated from Programmirovanie, Vol. 31, No. 4, 2005.
Original Russian Text Copyright © 2005 by Slavin.
Rights and permissions
About this article
Cite this article
Slavin, O.A. Recognition Algorithms for Structured Documents with Variable Content. Program Comput Soft 31, 211–223 (2005). https://doi.org/10.1007/s11086-005-0033-5
Received:
Issue Date:
DOI: https://doi.org/10.1007/s11086-005-0033-5