Skip to main content

Forming grammars for structured documents: an application of grammatical inference

  • Conference paper
  • First Online:
Grammatical Inference and Applications (ICGI 1994)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 862))

Included in the following conference series:

Abstract

We consider the problem of generating grammars for classes of structured documents — dictionaries, encyclopedias, user manuals, and so on — from examples. The examples consist of structures of individual documents, and they can be collected either by converting typographical tagging of documents prepared for printing into structural tags, or by using document recognition techniques. Our method forms first finite-state automata describing the examples completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying language. Finally, the automata are converted into regular expressions, and they are used to construct the grammar. In addition to automata, an alternative representation, characteristic k-grams, is introduced. Some interactive operations are also described that are necessary for generating a grammar for a large and complicated document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. André, R. Furuta, and V. Quint. By way of an introduction. Structured documents: What and why? In J. André, R. Furuta, and V. Quint, editors, Structured Documents, The Cambridge Series on Electronic Publishing, pages 1–6. Cambridge University Press, 1989.

    Google Scholar 

  2. Helena Ahonen. Generating grammars for structured documents using grammatical inference methods. Ph. Lic. thesis, University of Helsinki, Department of Computer Science, 1994.

    Google Scholar 

  3. Dana Angluin. Inference of reversible languages. Journal of the ACM, 29(3):741–765, 1982.

    Article  Google Scholar 

  4. Dana Angluin and Carl H. Smith. Inductive inference: Theory and methods. Computing Surveys, 15(3):237–269, 1983.

    Article  Google Scholar 

  5. Heather Brown. Standards for structured documents. The Computer Journal, 32(6):505–514, December 1989.

    Google Scholar 

  6. Peter Fankhauser and Yi Xu. Markitup! An incremental approach to document structure recognition. Electronic Publishing — Origination, Dissemination and Design, 6(4):447–456, 1994.

    Google Scholar 

  7. C. F. Goldfarb. The SGML Handbook. Oxford University Press, 1990.

    Google Scholar 

  8. Tao Hu and Rolf Ingold. A mixed approach toward an efficient logical structure recognition from document images. Electronic Publishing — Origination, Dissemination and Design, 6(4):457–468, 1994.

    Google Scholar 

  9. John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison Wesley, Reading, MA, 1979.

    Google Scholar 

  10. Pekka KilpelÄinen, Greger Lindén, Heikki Mannila, and Erja Nikunen. A structured document database system. In Richard Furuta, editor, EP90 — Proceedings of the International Conference on Electronic Publishing, Document Manipulation & Typography, The Cambridge Series on Electronic Publishing, pages 139–151. Cambridge University Press, 1990.

    Google Scholar 

  11. Stephen Muggleton. Inductive Acquisition of Expert Knowledge. Addison Wesley, Reading, MA, 1990.

    Google Scholar 

  12. Information Processing — Text and Office Systems — Office Document Architecture (ODA) and Interchange Format. Technical Report ISO/IEC 8613, International Organization for Standardization ISO/IEC, Geneva/New York, 1989.

    Google Scholar 

  13. Vincent Quint. Systems for the manipulation of structured documents. In J. André, R. Furuta, and V. Quint, editors, Structured Documents, The Cambridge Series on Electronic Publishing, pages 39–74. Cambridge University Press, 1989.

    Google Scholar 

  14. Giovanni Semeraro, Floriana Esposito, and Donato Malerba. Learning contextual rules for document understanding. In Proceedings of the Tenth IEEE Conference on Artificial Intelligence for Applications, pages 108–115, 1994.

    Google Scholar 

  15. Information Processing — Text and Office Systems — Standard Generalized Markup Language (SGML). Technical Report ISO/IEC 8879, International Organization for Standardization ISO/IEC, Geneva/New York, 1986.

    Google Scholar 

  16. Suomen kielen perussanakirja. EnsimmÄinen osa (A-K). Valtion painatuskeskus, Helsinki, 1990.

    Google Scholar 

  17. S.N. Srihari and G.W. Zack. Document image analysis. In Proceedings of the Eighth International Conference on Pattern Recognition, pages 434–436. IEEE Computer Society Press, 1986.

    Google Scholar 

  18. Yuan Yan Tang, Chang De Yan, and Ching Y. Suen. Document processing for automatic knowledge acquisition. IEEE Transactions on Knowledge and Data Engineering, 6(1):3–21, 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Rafael C. Carrasco Jose Oncina

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ahonen, H., Mannila, H., Nikunen, E. (1994). Forming grammars for structured documents: an application of grammatical inference. In: Carrasco, R.C., Oncina, J. (eds) Grammatical Inference and Applications. ICGI 1994. Lecture Notes in Computer Science, vol 862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58473-0_145

Download citation

  • DOI: https://doi.org/10.1007/3-540-58473-0_145

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58473-5

  • Online ISBN: 978-3-540-48985-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics