Abstract
We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the mark-up encodings, which allow to find the schema without knowledge of the encoding function, under reasonable assumptions on the input data. Depending upon the encoding of empty sets, we propose two polynomial on-line algorithms (with different buffer size) solving the schema finding problem. We also prove that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data. Finally, we show that the proposed techniques are well-suited for practical applications, such as structuring and wrapping HTML pages and Web sites.
Work supported by Università di Roma Tre, MURST and Consiglio Nazionale delle Ricerche.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
S. Abiteboul and C. Beeri. On the power of languages for the manipulation of complex objects. The VLDB Journal, 4(4):117–138, 1995.
S. Abiteboul. Querying semi-structured data. In ICDT’97.
ACC+97._S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Siméon. Querying documents in object databases. Journal of Digital Libraries, 1(1):5–19, April 1997.
B. Adelberg. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD’98.
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1994.
N. Ashish and C. Knoblock. Wrapper generation for semistructured Internet sources. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with SIGMOD’97).
P. Atzeni and G. Mecca. Cut and Paste. In PODS’97.
P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In VLDB’97.
D. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98).
V. Crescenzi and G. Mecca. Grammars have exceptions. Information Systems, 1998. Special Issue on Semistructured Data, to appear.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
S. Grumbach and V. Vianu. Tractable query languages for complex object databases. Journal of Computer and System Sciences, 51(2):149–167, 1995.
HGMC+97._J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD, 1997).
K. Han and H. J. Kim. Prediction of common folding structures of homologous RNAs. Nucleic Acids Research, 21(5):1251–1257, 1993.
R. Hull. A survey of theoretical research on typed complex database objects. In J. Paredaens, editor, Databases, pages 193–256. Academic Press, 1988.
ISO. International Organization for Standardization. ISO-8879: Information Processing-Text and Office Systems-Standard Generalized Markup Language (SGML), October 1986.
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.
E. R. Lassettre. Olympic records for data at the 1998 Nagano Games. In SIGMOD’98. Industrial Session.
G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over Web views. In EDBT’98.
Nagano 1998 Winter Olympics Web site. http://www.nagano.olympic.-org.
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In SIGMOD, 1998.
A. Ohori. Semantics of types for database objects. Theoretical Computer Science, 76(1):53–91, 1990.
C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
M.S. Waterman. Mathematical Methods for DNA Sequences. CRC Press, 1989.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grumbach, S., Mecca, G. (1999). In Search of the Lost Schema. In: Beeri, C., Buneman, P. (eds) Database Theory — ICDT’99. ICDT 1999. Lecture Notes in Computer Science, vol 1540. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49257-7_20
Download citation
DOI: https://doi.org/10.1007/3-540-49257-7_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65452-0
Online ISBN: 978-3-540-49257-3
eBook Packages: Springer Book Archive