XML subtree reconstruction from relational storage of XML documents

https://doi.org/10.1016/j.datak.2006.08.002Get rights and content

Abstract

Numerous researchers have proposed to use relational databases to store and query XML documents. In these systems, the elements selected by an XML query are returned to an application either by select mode or by reconstruct mode. For the reconstruct mode, the XML subtrees that are rooted at the selected elements need to be extracted and reconstructed from the relational storage of XML documents. Therefore, XML subtree reconstruction is an important problem since its efficiency has a significant impact on XML query response time. In this paper, we propose (i) a linear XML subtree reconstruction algorithm Reconstruct to reconstruct an XML subtree from the structure-encoded sequence of the subtree that is extracted from the relational database by a structure-encoded sequence retrieval algorithm, (ii) a generic efficient structure-encoded sequence retrieval algorithm RD-SB for a schema-based relational XML storage, and (iii) a generic efficient structure-encoded sequence retrieval algorithm RD-SL for a schema-less relational XML storage. To the best of our knowledge, our algorithms provide the first generic solutions to the XML subtree reconstruction problem that are applicable to all relational XML storage schemes proposed in the literature. Finally, our experiments show that our algorithms are efficient and scalable.

Introduction

Recently, there has been extensive work in the development of XML storage and management systems. Such systems fall into two broad categories: relational-based [11], [15], [29], [34], [23], [26], [5], [25] and native [28], [19], [16], [36], [37], [2] XML databases. The former ones use relational databases and XML-to-Relational data model mappings to store and query XML documents. The latter ones define a logical model for an XML document – as opposed to the data in that document – and store and retrieve documents according to that model, while it is usually not required to have any particular underlying physical storage model.

In these systems, the elements selected by an XML query are returned to an application in one of the following two modes [38]: (1) Select mode: the unique identifiers (IDs) of the selected elements are returned to the application; and the application can extract the contents of these elements later if necessary, or (2) Reconstruct mode: the XML subtrees, rooted at the selected elements, are extracted and reconstructed from the storage of XML documents and returned to the application. Therefore, XML subtree reconstruction is one important component of such systems and its performance has a great impact on the query response time.

While previous work on XML publishing [32], [31], [12], [13], [14], [6] focuses on publishing existing relational data as XML documents, we are interested in reconstructing XML subtrees from relational storage of XML documents, which requires the XML-to-Relational schema mapping information to derive the Relational-to-XML mapping information and reconstruct XML subtrees that conform to the structures of the original XML documents. Therefore, these XML publishing methods are not directly applicable to our scenario. For example, Shanmugasundaram et al. [32] describe several methods for computing XML views with relational engines, but none of them uses the XML-to-Relational mapping information and enforces the structure conformance constraint between a reconstructed XML subtree and the original XML document. Similarly, Fernandez et al. [13] take an XML view approach in their middle-ware system SilkRoute but focus on decomposing an XML view query into an optimal set of SQL queries. For the same reason, this method is not applicable to our problem either.

Although Krishnamurthy et al. [22] briefly discuss XML subtree reconstruction in the context of recursive XML schemas, no algorithmic details are provided. Moreover, the approach is not generic because it does not support the two classic schema-based storage schemes, Basic and Hybrid [34], in which XML elements of the same type might be stored in multiple relations. Other researchers [38], [39], [15] only report some experimental results on XML subtree reconstruction without presenting an algorithm. Although Shanmugasundaram et al. [33] did present an algorithm for generating a reconstruction XML view, the reconstruction relies on an XML query engine that is already in place, which in general is not available. Therefore, this method is not generic since it cannot be applicable across all relational XML storage schemes.

The main contributions of this paper are:

  • (1)

    We propose a linear XML subtree reconstruction algorithm Reconstruct to reconstruct an XML subtree from the structure-encoded sequence of the subtree that is extracted from the relational database by a structure-encoded sequence retrieval algorithm.

  • (2)

    We propose a generic efficient structure-encoded sequence retrieval algorithm RD-SB for a schema-based relational XML storage.

  • (3)

    We propose a generic efficient structure-encoded sequence retrieval algorithm RD-SL for a schema-less relational XML storage.

  • (4)

    Finally, our experiments show that our algorithms are efficient and scalable.

This work is an extension of our previously proposed algorithm [8], which is only applicable to the Shared approach [34]. To the best of our knowledge, the algorithms presented in this paper provide the first generic solutions to the XML subtree reconstruction problem that are applicable to all relational XML storage schemes proposed in the literature.

Organization. The rest of the paper is organized as follows. Section 2 summarizes related work. Section 3 defines the notion of a structure-encoded sequence and introduces our XML subtree reconstruction algorithm Reconstruct. Section 4 formalizes the notion of a schema-based relational XML storage scheme and proposes algorithm RD-SB for a schema-based relational XML storage. Section 5 formalizes the notion of a schema-less relational XML storage scheme and proposes algorithm RD-SL for a schema-less relational XML storage. Finally, we present our conclusions and future work in Section 6.

Section snippets

Related work

Numerous researchers have proposed to use relational databases to store and query XML documents [11], [15], [29], [34], [23], [26], [5], [25]. The main challenge of this approach is that one needs to resolve the conflict between the hierarchical, ordered nature of the XML data model and the flat, unordered nature of the relational data model. This conflict can be resolved by the following three XML-to-Relational mappings:

  • Schema generation/mapping. Either a fixed generic relational schema is

XML subtree reconstruction

In this section, we propose a generic and linear XML subtree reconstruction algorithm Reconstruct, discuss its support of various numbering schemes, and present our experimental study of the algorithm.

Structure-encoded sequence retrieval from a schema-based relational XML storage

In the following, we define XML-to-Relational mappings to capture a schema-based relational XML storage scheme, propose our generic structure-encoded sequence retrieval algorithm RD-SB for such an XML storage, and report the results of our experimental study.

Structure-encoded sequence retrieval from a schema-less relational XML storage

In the following, we define XML-to-Relational mappings to capture a schema-less relational XML storage scheme, propose our generic structure-encoded sequence retrieval algorithm RD-SL for such an XML storage, and report the results of our experimental study.

Conclusions and future work

Recent interest in storing and querying XML documents using RDBMSs motivated us to design algorithms for reconstructing XML subtrees from a relational database. In our proposed solution, a reconstruction algorithm is composed from algorithm Reconstruct, that reconstructs an XML subtree, rooted at an arbitrary element, from the structure-encoded sequence of the subtree, and a structure-encoded sequence retrieval algorithm, that retrieves the sequence from a relational database.

Reconstruct is a

Artem Chebotko is currently a Ph.D. student in the Department of Computer Science at Wayne State University. His research interests include XML databases, Semantic Web and database security. He has published several refereed international journal and conference papers, including articles that appeared in Data & Knowledge Engineering, Information Systems and the International Journal on Semantic Web and Information Systems. He is a member of ACM and IEEE.

References (43)

  • S. Sipani et al.

    Designing a high-performance database engine for the ‘Db4XML’ native XML database system

    Journal of Systems and Software

    (2004)
  • S. Al-Khalifa, H.V. Jagadish, J.M. Patel, Y. Wu, N. Koudas, D. Srivastava, Structural joins: a primitive for efficient...
  • Apache Software Foundation, Apache Xindice. Available from:...
  • M. Atay, A. Chebotko, D. Liu, S. Lu, F. Fotouhi, Efficient schema-based XML-to-relational data mapping, Information...
  • M. Atay, D. Liu, Y. Sun, S. Lu, F. Fotouhi, Mapping XML data to relational data: A DOM-based approach, in: Proc. of the...
  • P. Bohannon, J. Freire, P. Roy, J. Simeon, From XML schema to relations: A cost-based approach to XML storage, in:...
  • P. Bohannon, S. Ganguly, H. Korth, P.P.S. Narayan, P. Shenoy, Optimizing view queries in ROLEX to support navigable...
  • N. Bruno, N. Koudas, D. Srivastava, Holistic twig joins: optimal XML pattern matching, in: Proc. of the ACM SIGMOD...
  • A. Chebotko, D. Liu, M. Atay, S. Lu, F. Fotouhi, Reconstructing XML subtrees from relational storage of XML documents,...
  • S.-Y. Chien, Z. Vagena, D. Zhang, V.J. Tsotras, C. Zaniolo, Efficient structural joins on indexed XML documents, in:...
  • D. DeHaan, D. Toman, M.P. Consens, M.T. Ozsu, A comprehensive XQuery to SQL translation using dynamic interval...
  • A. Deutsch, M.F. Fernandez, D. Suciu, Storing semistructured data with STORED, in: Proc. of the ACM SIGMOD Conference,...
  • M.F. Fernandez et al.

    SilkRoute: A framework for publishing relational data in XML

    ACM Transactions on Database Systems

    (2002)
  • M.F. Fernandez, A. Morishima, D. Suciu, Efficient evaluation of XML middle-ware queries, in: Proc. of the ACM SIGMOD...
  • M.F. Fernandez, A. Morishima, W.C. Tan, SilkRoute: Trading between relations and XML, in: Proc. of the World Wide Web...
  • D. Florescu et al.

    Storing and querying XML data using an RDBMS

    IEEE Data Engineering Bulletin

    (1999)
  • A. Fomichev, M. Grinev, S. Kuznetsov, Descriptive schema driven XML storage, Technical Report, Institute for System...
  • T. Grust, Accelerating XPath location steps, in: Proc. of the ACM SIGMOD Conference, 2002, pp....
  • T. Grust, M. van Keulen, J. Teubner, Staircase join: teach a relational DBMS to watch its axis steps, in: Proc. of the...
  • Institute for System Programming of Russian Academy of Sciences, Sedna Native XML DBMS. Available from:...
  • S. Jain, R. Mahajan, D. Suciu, Translating XSLT programs to efficient SQL queries, in: Proc. of the World Wide Web...
  • Cited by (0)

    Artem Chebotko is currently a Ph.D. student in the Department of Computer Science at Wayne State University. His research interests include XML databases, Semantic Web and database security. He has published several refereed international journal and conference papers, including articles that appeared in Data & Knowledge Engineering, Information Systems and the International Journal on Semantic Web and Information Systems. He is a member of ACM and IEEE.

    Mustafa Atay received his Ph.D. degree in computer science from Wayne State University in 2006. He joined the faculty of Wayne State University in the same year as a lecturer of the Department of Computer Science. His research interests include XML data management and performance analysis of query processing in database systems. He has published several refereed international journal and conference papers on the above areas. He is a member of ACM.

    Shiyong Lu received his Ph.D. degree in computer science from Stony Brook University in 2002. He is currently an assistant professor in the Department of Computer Science of Wayne State University. His research interests include XML databases, Semantic Web, workflows, database security, data mining, and bioinformatics. He has published over 60 refereed international journal and conference papers in the above areas. He is an editorial board member of the International Journal on Semantic Web and Information Systems and of the International Journal of Medical Information System and Informatics. He is a program committee member of several IEEE and ACM conferences. He is a member of both ACM and IEEE.

    Farshad Fotouhi received his Ph.D. in computer science from Michigan State University in 1988. He joined the faculty of Computer Science at Wayne State University in August 1988 where he is currently Professor and Chair of the department. Dr. Farshad Fotouhi’s major areas of research include XML databases, relational and object-oriented databases, Semantic Web, multimedia systems, and query optimization. He has published over 100 papers in refereed journals and conference proceedings and has served as a program committee member of various database related conferences. Dr. Farshad Fotouhi is on the Editorial Boards of the IEEE Multimedia Magazine and the International Journal on Semantic Web and Information Systems.

    View full text