Multi-level Schema Extraction for Heterogeneous Semi-structured Data

Yoon, Jong P.; Raghavan, Vijay

doi:10.1007/3-540-45151-X_39

Jong P. Yoon⁶ &
Vijay Raghavan⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1846))

Included in the following conference series:

International Conference on Web-Age Information Management

377 Accesses

Abstract

Heterogeneous information sources are organized in various different degrees from well-structured data, to unstructured and semi-structured data. Such information sources do not have rigid schema available in advance or even if each source has its own schema, there are no enforced modeling constraints or formats for data across information sources. In this paper, we propose a novel method for abstracting schemas for heterogeneous information sources. At the most detailed level, information sources are represented in a labeled directed graph. We develop several abstraction operations for label generalization and aggregation. One of more of these operations can be applied to a labeled directed graph to “levelize” schemas. Each such level of the schemas is a potentially useful paradigm for query formation and optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68–88, 1997.
Article Google Scholar
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Sushil Jajodia, editor, Proc. of the ACM SIGMOD Conf. on Management of Data, pages 207–216, Washington D.C., 1993.
Google Scholar
T. Bray, J. Paoli, and C. Sperberg-McQueen. Extensible markup language (XML) 1.0. World Wide Web Consortium Recommendation. Available at http://www.w3.org/TR/REC-xml, 2 1998.
S. Chakrabarti, B Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 307–318, 1998.
Google Scholar
V. Christophides, S. Cluet, and G. Moerkotte. Evaluating queries with generalized path expressions. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 413–422, Montreal, Canada, 1996.
Google Scholar
R. Dolin, D. Agrawal, and A. Abbadi. Scalable collection summarization and selection. In Proc. of ACM Conference on Digital Libraries, pages 49–58, 1999.
Google Scholar
M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schema. In IEEE Data Engineering, pages 14–23, 1998.
Google Scholar
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proc. Intl. Conf. on Very Large Data Bases, 1997.
Google Scholar
R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proc. Intl. Conf. on Very Large Data Bases, 1997.
Google Scholar
R. Goldman and J. Widom. Approximate dataGuides. Technical report, Stanford University, 1999.
Google Scholar
R. Light and T. Bray. Presenting XML. Sams, Indianapolis, Indiana, 1997.
Google Scholar
R. S. Michalski and R. E. Stepp. Learning from observation: Conceptual clustering. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, pages 331–363. Tioga, 1983.
Google Scholar
J. P. Tremblay and R. Manohar. Discrete Mathematical Structures with Applications to Computer Science. McGraw Hill, New York, 1975.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Center for Advanced Computer Studies, University of Louisiana, Lafayette, LA, 70504-4330
Jong P. Yoon & Vijay Raghavan

Authors

Jong P. Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Raghavan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Hongjun Lu
Department of Computer Science, Fudan University, 220 Handan Road, Shanghai, China
Aoying Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoon, J.P., Raghavan, V. (2000). Multi-level Schema Extraction for Heterogeneous Semi-structured Data. In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_39

Download citation

DOI: https://doi.org/10.1007/3-540-45151-X_39
Published: 07 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67627-0
Online ISBN: 978-3-540-45151-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics