Skip to main content

Multi-level Schema Extraction for Heterogeneous Semi-structured Data

  • Conference paper
  • First Online:
Web-Age Information Management (WAIM 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1846))

Included in the following conference series:

Abstract

Heterogeneous information sources are organized in various different degrees from well-structured data, to unstructured and semi-structured data. Such information sources do not have rigid schema available in advance or even if each source has its own schema, there are no enforced modeling constraints or formats for data across information sources. In this paper, we propose a novel method for abstracting schemas for heterogeneous information sources. At the most detailed level, information sources are represented in a labeled directed graph. We develop several abstraction operations for label generalization and aggregation. One of more of these operations can be applied to a labeled directed graph to “levelize” schemas. Each such level of the schemas is a potentially useful paradigm for query formation and optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68–88, 1997.

    Article  Google Scholar 

  2. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Sushil Jajodia, editor, Proc. of the ACM SIGMOD Conf. on Management of Data, pages 207–216, Washington D.C., 1993.

    Google Scholar 

  3. T. Bray, J. Paoli, and C. Sperberg-McQueen. Extensible markup language (XML) 1.0. World Wide Web Consortium Recommendation. Available at http://www.w3.org/TR/REC-xml, 2 1998.

  4. S. Chakrabarti, B Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 307–318, 1998.

    Google Scholar 

  5. V. Christophides, S. Cluet, and G. Moerkotte. Evaluating queries with generalized path expressions. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 413–422, Montreal, Canada, 1996.

    Google Scholar 

  6. R. Dolin, D. Agrawal, and A. Abbadi. Scalable collection summarization and selection. In Proc. of ACM Conference on Digital Libraries, pages 49–58, 1999.

    Google Scholar 

  7. M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schema. In IEEE Data Engineering, pages 14–23, 1998.

    Google Scholar 

  8. D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proc. Intl. Conf. on Very Large Data Bases, 1997.

    Google Scholar 

  9. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proc. Intl. Conf. on Very Large Data Bases, 1997.

    Google Scholar 

  10. R. Goldman and J. Widom. Approximate dataGuides. Technical report, Stanford University, 1999.

    Google Scholar 

  11. R. Light and T. Bray. Presenting XML. Sams, Indianapolis, Indiana, 1997.

    Google Scholar 

  12. R. S. Michalski and R. E. Stepp. Learning from observation: Conceptual clustering. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, pages 331–363. Tioga, 1983.

    Google Scholar 

  13. J. P. Tremblay and R. Manohar. Discrete Mathematical Structures with Applications to Computer Science. McGraw Hill, New York, 1975.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yoon, J.P., Raghavan, V. (2000). Multi-level Schema Extraction for Heterogeneous Semi-structured Data. In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_39

Download citation

  • DOI: https://doi.org/10.1007/3-540-45151-X_39

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67627-0

  • Online ISBN: 978-3-540-45151-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics