Forming grammars for structured documents: an application of grammatical inference

Ahonen, Helena; Mannila, Heikki; Nikunen, Erja

doi:10.1007/3-540-58473-0_145

Helena Ahonen¹,
Heikki Mannila¹ &
Erja Nikunen²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 862))

Included in the following conference series:

International Colloquium on Grammatical Inference

142 Accesses
13 Citations

Abstract

We consider the problem of generating grammars for classes of structured documents — dictionaries, encyclopedias, user manuals, and so on — from examples. The examples consist of structures of individual documents, and they can be collected either by converting typographical tagging of documents prepared for printing into structural tags, or by using document recognition techniques. Our method forms first finite-state automata describing the examples completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying language. Finally, the automata are converted into regular expressions, and they are used to construct the grammar. In addition to automata, an alternative representation, characteristic k-grams, is introduced. Some interactive operations are also described that are necessary for generating a grammar for a large and complicated document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. André, R. Furuta, and V. Quint. By way of an introduction. Structured documents: What and why? In J. André, R. Furuta, and V. Quint, editors, Structured Documents, The Cambridge Series on Electronic Publishing, pages 1–6. Cambridge University Press, 1989.
Google Scholar
Helena Ahonen. Generating grammars for structured documents using grammatical inference methods. Ph. Lic. thesis, University of Helsinki, Department of Computer Science, 1994.
Google Scholar
Dana Angluin. Inference of reversible languages. Journal of the ACM, 29(3):741–765, 1982.
Article Google Scholar
Dana Angluin and Carl H. Smith. Inductive inference: Theory and methods. Computing Surveys, 15(3):237–269, 1983.
Article Google Scholar
Heather Brown. Standards for structured documents. The Computer Journal, 32(6):505–514, December 1989.
Google Scholar
Peter Fankhauser and Yi Xu. Markitup! An incremental approach to document structure recognition. Electronic Publishing — Origination, Dissemination and Design, 6(4):447–456, 1994.
Google Scholar
C. F. Goldfarb. The SGML Handbook. Oxford University Press, 1990.
Google Scholar
Tao Hu and Rolf Ingold. A mixed approach toward an efficient logical structure recognition from document images. Electronic Publishing — Origination, Dissemination and Design, 6(4):457–468, 1994.
Google Scholar
John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison Wesley, Reading, MA, 1979.
Google Scholar
Pekka KilpelÄinen, Greger Lindén, Heikki Mannila, and Erja Nikunen. A structured document database system. In Richard Furuta, editor, EP90 — Proceedings of the International Conference on Electronic Publishing, Document Manipulation & Typography, The Cambridge Series on Electronic Publishing, pages 139–151. Cambridge University Press, 1990.
Google Scholar
Stephen Muggleton. Inductive Acquisition of Expert Knowledge. Addison Wesley, Reading, MA, 1990.
Google Scholar
Information Processing — Text and Office Systems — Office Document Architecture (ODA) and Interchange Format. Technical Report ISO/IEC 8613, International Organization for Standardization ISO/IEC, Geneva/New York, 1989.
Google Scholar
Vincent Quint. Systems for the manipulation of structured documents. In J. André, R. Furuta, and V. Quint, editors, Structured Documents, The Cambridge Series on Electronic Publishing, pages 39–74. Cambridge University Press, 1989.
Google Scholar
Giovanni Semeraro, Floriana Esposito, and Donato Malerba. Learning contextual rules for document understanding. In Proceedings of the Tenth IEEE Conference on Artificial Intelligence for Applications, pages 108–115, 1994.
Google Scholar
Information Processing — Text and Office Systems — Standard Generalized Markup Language (SGML). Technical Report ISO/IEC 8879, International Organization for Standardization ISO/IEC, Geneva/New York, 1986.
Google Scholar
Suomen kielen perussanakirja. EnsimmÄinen osa (A-K). Valtion painatuskeskus, Helsinki, 1990.
Google Scholar
S.N. Srihari and G.W. Zack. Document image analysis. In Proceedings of the Eighth International Conference on Pattern Recognition, pages 434–436. IEEE Computer Society Press, 1986.
Google Scholar
Yuan Yan Tang, Chang De Yan, and Ching Y. Suen. Document processing for automatic knowledge acquisition. IEEE Transactions on Knowledge and Data Engineering, 6(1):3–21, 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science University of Helsinki, University of Helsinki, Teollisuuskatu 23, P.O. Box 26, FIN-00014, Finland
Helena Ahonen & Heikki Mannila
Research Centre for Domestic Languages, SörnÄisten rantatie 25, FIN-00500, Helsinki, Finland
Erja Nikunen

Authors

Helena Ahonen
View author publications
You can also search for this author in PubMed Google Scholar
Heikki Mannila
View author publications
You can also search for this author in PubMed Google Scholar
Erja Nikunen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Rafael C. Carrasco Jose Oncina

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahonen, H., Mannila, H., Nikunen, E. (1994). Forming grammars for structured documents: an application of grammatical inference. In: Carrasco, R.C., Oncina, J. (eds) Grammatical Inference and Applications. ICGI 1994. Lecture Notes in Computer Science, vol 862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58473-0_145

Download citation

DOI: https://doi.org/10.1007/3-540-58473-0_145
Published: 04 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58473-5
Online ISBN: 978-3-540-48985-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics