Abstract
We consider the problem of generating grammars for classes of structured documents — dictionaries, encyclopedias, user manuals, and so on — from examples. The examples consist of structures of individual documents, and they can be collected either by converting typographical tagging of documents prepared for printing into structural tags, or by using document recognition techniques. Our method forms first finite-state automata describing the examples completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying language. Finally, the automata are converted into regular expressions, and they are used to construct the grammar. In addition to automata, an alternative representation, characteristic k-grams, is introduced. Some interactive operations are also described that are necessary for generating a grammar for a large and complicated document.
Preview
Unable to display preview. Download preview PDF.
References
J. André, R. Furuta, and V. Quint. By way of an introduction. Structured documents: What and why? In J. André, R. Furuta, and V. Quint, editors, Structured Documents, The Cambridge Series on Electronic Publishing, pages 1–6. Cambridge University Press, 1989.
Helena Ahonen. Generating grammars for structured documents using grammatical inference methods. Ph. Lic. thesis, University of Helsinki, Department of Computer Science, 1994.
Dana Angluin. Inference of reversible languages. Journal of the ACM, 29(3):741–765, 1982.
Dana Angluin and Carl H. Smith. Inductive inference: Theory and methods. Computing Surveys, 15(3):237–269, 1983.
Heather Brown. Standards for structured documents. The Computer Journal, 32(6):505–514, December 1989.
Peter Fankhauser and Yi Xu. Markitup! An incremental approach to document structure recognition. Electronic Publishing — Origination, Dissemination and Design, 6(4):447–456, 1994.
C. F. Goldfarb. The SGML Handbook. Oxford University Press, 1990.
Tao Hu and Rolf Ingold. A mixed approach toward an efficient logical structure recognition from document images. Electronic Publishing — Origination, Dissemination and Design, 6(4):457–468, 1994.
John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison Wesley, Reading, MA, 1979.
Pekka KilpelÄinen, Greger Lindén, Heikki Mannila, and Erja Nikunen. A structured document database system. In Richard Furuta, editor, EP90 — Proceedings of the International Conference on Electronic Publishing, Document Manipulation & Typography, The Cambridge Series on Electronic Publishing, pages 139–151. Cambridge University Press, 1990.
Stephen Muggleton. Inductive Acquisition of Expert Knowledge. Addison Wesley, Reading, MA, 1990.
Information Processing — Text and Office Systems — Office Document Architecture (ODA) and Interchange Format. Technical Report ISO/IEC 8613, International Organization for Standardization ISO/IEC, Geneva/New York, 1989.
Vincent Quint. Systems for the manipulation of structured documents. In J. André, R. Furuta, and V. Quint, editors, Structured Documents, The Cambridge Series on Electronic Publishing, pages 39–74. Cambridge University Press, 1989.
Giovanni Semeraro, Floriana Esposito, and Donato Malerba. Learning contextual rules for document understanding. In Proceedings of the Tenth IEEE Conference on Artificial Intelligence for Applications, pages 108–115, 1994.
Information Processing — Text and Office Systems — Standard Generalized Markup Language (SGML). Technical Report ISO/IEC 8879, International Organization for Standardization ISO/IEC, Geneva/New York, 1986.
Suomen kielen perussanakirja. EnsimmÄinen osa (A-K). Valtion painatuskeskus, Helsinki, 1990.
S.N. Srihari and G.W. Zack. Document image analysis. In Proceedings of the Eighth International Conference on Pattern Recognition, pages 434–436. IEEE Computer Society Press, 1986.
Yuan Yan Tang, Chang De Yan, and Ching Y. Suen. Document processing for automatic knowledge acquisition. IEEE Transactions on Knowledge and Data Engineering, 6(1):3–21, 1994.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ahonen, H., Mannila, H., Nikunen, E. (1994). Forming grammars for structured documents: an application of grammatical inference. In: Carrasco, R.C., Oncina, J. (eds) Grammatical Inference and Applications. ICGI 1994. Lecture Notes in Computer Science, vol 862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58473-0_145
Download citation
DOI: https://doi.org/10.1007/3-540-58473-0_145
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58473-5
Online ISBN: 978-3-540-48985-6
eBook Packages: Springer Book Archive