Abstract
This paper shows a data model for transforming and assembling document information such as SGML or XML documents. The biggest advantage over other data models is that this data model simultaneously provides (1) powerful patterns and contextual conditions, and (2) schema transformation. Patterns and contextual conditions capture conditions on subordinates and those on superiors, siblings, subordinates of siblings, etc, respectively, and have been recognized as highly important mechanisms for identifying document components in the document processing community. Meanwhile, schema transformation has been, since the RDB, recognized as crucial in the database community. However, no data models have provided all three of patterns, contextual conditions, and schema transformation.
This data model is based on the forest-regular language theory. A schema is a forest automaton and an instance is a finite set of forests (sequences of trees). Since the parse tree set of an extended-context free grammar is accepted by a forest automaton, this model is a generalization of Gonnet and Tompa’s grammatical model. Patterns are captured as forest automatons; contextual conditions are pointed forest representations (a variation of Podelski’s pointed tree representations). Controlled by patterns and contextual conditions, an operator creates an instance from an input instance and also creates a reasonably small schema from an input schema. Furthermore, the created schema is often minimally sufficient; any forest permitted by it may be generated by some input instance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Cluet, S., Milo, T.: Querying and updating the file. VLDB’ 93 19 (1993) 73–84
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)
Baeza-Yates, R., Navarro, G.: Integrating contents and structure in text retrieval. SIGMOD Record 25:1 (1996) 67–79
Blake, G., Bray, T., Tompa, F.: Shortening the OED: Experience with a grammar-defined database. ACM TOIS 10:3 (1992) 213–232
Christophides, V., Abiteboul, S., Cluet, S., Scholl, M.: From structured documents to novel query facilities. SIGMOD Record 23:2 (1994) 313–324
Colby, L., Van Gucht, D., Saxton, L.: Concepts for modeling and querying list-structured data. Information Processing & Management 30:5 (1994) 687–709
Gécseg, F., Steinby, M.: Tree Automata. Akadémiai Kiadó (1984)
Gonnet, G., Tompa, F.: Mind your grammar: a new approach to modeling text. VLDB’ 87 13 (1987) 339–346
Gyssens, M., Paredaens, J., Van Gucht, D.: A grammar-based approach towards unifying hierarchical data models. SIAM Journal on Computing 23:6 (1994) 1093–1137
Murata, M.: Transformation of documents and schemas by patterns and contextual conditions. Lecture Notes in Computer Science 1293 (1997) 153–169
Pair, C., Quere, A.: Définition et etude des bilangages réguliers. Information and Control 13:6 (1968) 565–593
Podelski, A.: A monoid approach to tree automata. In Tree Automata and Languages North-Holland (1992) 41–56
Takahashi, M.: Generalizations of regular sets and their application to a study of context-free languages. Information and Control 27 (1975) 1–36
Zdonik, S., Maier, D.: Readings in Object-Oriented Database Systems. Morgan Kaufmann (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Murata, M. (1998). Data Model for Document Transformation and Assembly. In: Munson, E.V., Nicholas, C., Wood, D. (eds) Principles of Digital Document Processing. PODDP 1998. Lecture Notes in Computer Science, vol 1481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49654-8_12
Download citation
DOI: https://doi.org/10.1007/3-540-49654-8_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65086-7
Online ISBN: 978-3-540-49654-0
eBook Packages: Springer Book Archive