Schema Extraction and Integration of Heterogeneous XML Document Collections

Janga, Prudhvi; Davis, Karen C.

doi:10.1007/978-3-642-41366-7_15

Prudhvi Janga¹⁸ &
Karen C. Davis¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 8216))

Included in the following conference series:

International Conference on Model and Data Engineering

1196 Accesses
2 Citations

Abstract

The availability of vast amounts of heterogeneous XML web data motivates finding efficient methods to search, integrate, query, and present this data. The structure of XML documents is useful for achieving these tasks; however, not every XML document on the web includes a schema. We discuss challenges and accomplishments in the area of generation and integration of XML schemas. We propose and implement a framework for efficient schema extraction and integration from heterogeneous XML document collections collected from the web. Our approach introduces the Schema Extended Context Free Grammar (SECFG) to model XML schemas, including detection of attributes, data types, and element occurrences. Unlike other implementations, our approach supports the generation of XML schemas in any XML schema language, e.g., DTDs or XSD. We compare our approach with other proposed approaches and conclude that we offer the same or better functionality more efficiently and with greater flexibility.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Leonov, A.V., Khusnutdinov, R.R.: Study and Development of the DTD Generation System for XML Documents. Programming and Computer Software (PCS) 31(4), 197–210 (2005)
Article MATH Google Scholar
Chidlovskii, B.: Schema extraction from XML collections. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, Portland, Oregon, USA, June 14-18, pp. 291–292 (2002)
Google Scholar
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, May 16-18, pp. 165–176 (2000)
Google Scholar
Jung, J.-S., Oh, D.-I., Kong, Y.-H., Ahn, J.-K.: Extracting Information from XML Documents by Reverse Generating a DTD. In: Proceedings of the 1st EurAsian Conference on Information and Communication Technology (EurAsia ICT), Shiraz, Iran, October 29-31, pp. 314–321 (2002)
Google Scholar
Berman, L., Diaz, A.: Data Descriptors by Example (DDbE), IBM alphaworks (2001), http://www.alphaworks.ibm.com/tech/DDbE
Min, J.-K., Ahn, J.-Y., Chung, C.-W.: Efficient Extraction of Schemas for XML Documents. Information Processing Letters 85(1), 7–12 (2003)
Article MATH MathSciNet Google Scholar
Moh, C.-H., Lim, E.-P., Ng, W.K.: DTD-Miner: a tool for mining DTD from XML documents. In: Proceedings of the Second International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems (WECWIS 2000), Milpitas, California, USA, June 8-9, pp. 144–151 (2000)
Google Scholar
Passi, K., Lane, L., Madria, S.K., Sakamuri, B.C., Mohania, M., Bhowmick, S.S.: A model for XML Schema Integration. In: Bauknecht, K., Tjoa, A.M., Quirchmayr, G. (eds.) EC-Web 2002. LNCS, vol. 2455, pp. 193–202. Springer, Heidelberg (2002)
Chapter Google Scholar
Papakonstantinou, Y., Vianu, V.: DTD Inference for Views of XML Data. In: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), Dallas, Texas, USA, May 15-17, pp. 35–46 (2000)
Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article MATH Google Scholar
Wood, D.: Standard Generalized Markup Language: Mathematical and Philosophical Issues. In: van Leeuwen, J. (ed.) Computer Science Today. LNCS, vol. 1000, pp. 344–365. Springer, Heidelberg (1995)
Chapter Google Scholar
Xing, G., Parthepan, V.: Efficient Schema Extraction from a Large Collection of XML Documents. In: Proceedings of the 49th Annual Southeast Regional Conference, Kennesaw, GA, USA, March 24-26, pp. 92–96 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Cincinnati, Cincinnati, Ohio, USA
Prudhvi Janga & Karen C. Davis

Authors

Prudhvi Janga
View author publications
You can also search for this author in PubMed Google Scholar
Karen C. Davis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ICAR-CNR and University of Calabria, 87036, Cosenza, Italy
Alfredo Cuzzocrea
INRIA-Bordeaux Sud Ouest, France
Sofian Maabout

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Janga, P., Davis, K.C. (2013). Schema Extraction and Integration of Heterogeneous XML Document Collections. In: Cuzzocrea, A., Maabout, S. (eds) Model and Data Engineering. MEDI 2013. Lecture Notes in Computer Science, vol 8216. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41366-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-41366-7_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41365-0
Online ISBN: 978-3-642-41366-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics