Abstract
Heterogeneous information sources are organized in various different degrees from well-structured data, to unstructured and semi-structured data. Such information sources do not have rigid schema available in advance or even if each source has its own schema, there are no enforced modeling constraints or formats for data across information sources. In this paper, we propose a novel method for abstracting schemas for heterogeneous information sources. At the most detailed level, information sources are represented in a labeled directed graph. We develop several abstraction operations for label generalization and aggregation. One of more of these operations can be applied to a labeled directed graph to “levelize” schemas. Each such level of the schemas is a potentially useful paradigm for query formation and optimization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68–88, 1997.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Sushil Jajodia, editor, Proc. of the ACM SIGMOD Conf. on Management of Data, pages 207–216, Washington D.C., 1993.
T. Bray, J. Paoli, and C. Sperberg-McQueen. Extensible markup language (XML) 1.0. World Wide Web Consortium Recommendation. Available at http://www.w3.org/TR/REC-xml, 2 1998.
S. Chakrabarti, B Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 307–318, 1998.
V. Christophides, S. Cluet, and G. Moerkotte. Evaluating queries with generalized path expressions. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 413–422, Montreal, Canada, 1996.
R. Dolin, D. Agrawal, and A. Abbadi. Scalable collection summarization and selection. In Proc. of ACM Conference on Digital Libraries, pages 49–58, 1999.
M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schema. In IEEE Data Engineering, pages 14–23, 1998.
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proc. Intl. Conf. on Very Large Data Bases, 1997.
R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proc. Intl. Conf. on Very Large Data Bases, 1997.
R. Goldman and J. Widom. Approximate dataGuides. Technical report, Stanford University, 1999.
R. Light and T. Bray. Presenting XML. Sams, Indianapolis, Indiana, 1997.
R. S. Michalski and R. E. Stepp. Learning from observation: Conceptual clustering. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, pages 331–363. Tioga, 1983.
J. P. Tremblay and R. Manohar. Discrete Mathematical Structures with Applications to Computer Science. McGraw Hill, New York, 1975.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yoon, J.P., Raghavan, V. (2000). Multi-level Schema Extraction for Heterogeneous Semi-structured Data. In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_39
Download citation
DOI: https://doi.org/10.1007/3-540-45151-X_39
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67627-0
Online ISBN: 978-3-540-45151-8
eBook Packages: Springer Book Archive