Abstract
XML documents contain substantial redundancy in their structure part, because each path from the root node to a leaf node is explicitly represented and typically large sets of such path instances belong to a path class, i.e., the nodes of the path instances are labeled by the same sequence of element (or attribute) names. To save storage space and I/O cost, we want to get rid of this structural redundancy to the extent possible. While all known methods for the physical representation (storage) of XML documents proceed from the root via the element/attribute hierarchy (internal nodes) down to the leaves (values), we follow an upside-down approach which explicitly stores the values and only reconstructs the internal nodes, if needed. The cornerstones for such a solution are suitable node labels and a path synopsis which efficiently represents all path classes of an XML document. As a solution, we propose a compact internal storage format for native XML database systems where the inner structure of the stored documents is virtualized. Because this elementless storage format provides an efficient reconstruction of a document using its path synopsis, all processing properties are preserved and the semantics of navigational and declarative operations of XML languages remains unchanged. Adjusted indexes support the full spectrum of so-called content-and-structure single path queries. Apart from greatly reduced storage consumption, our approach demonstrates its superiority, compared to competing methods, not only for a substantial fraction of those queries, but also for storing, reconstructing, and navigating XML documents.
Similar content being viewed by others
References
Al-Khalifa S, Jagadish HV, Patel JM, Wu Y, Koudas N, Srivastava D (2002) Structural Joins: A Primitive for Efficient XML Query Pattern Matching. Proc. Int. Conf. on Data Engineering (ICDE), 141–152
Arion A, Bonifati A, Manolescu I, Pugliese A (2008) Path Summaries and Path Partitioning in Modern XML Databases. World Wide Web 11(1):117–151
Beyer KS, Cochrane R, Josifovski V, Kleewein J, Lapis G, Lohman GM, Lyle R, Özcan F, Pirahesh H, Seemann N, Truong TC, Van der Linden B, Vickery B, Zhang C (2005) System RX: One Part Relational, One Part XML, Proc. ACM SIGMOD Conf., 374–358
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Böhme T, Rahm E (2004) Supporting Efficient Streaming and Insertion of XML Data in RDBMS. Proc. 3rd DIWeb Workshop, 70–81
Bruno N, Koudas N, Srivastava D (2002) Holistic Twig Joins: Optimal XML Pattern Matching. Proc. ACM SIGMOD Conf., 310–321
Christophides V, Plexousakis D, Scholl M, Tourtounis S (2003) On Labeling Schemes for the Semantic Web. Proc. 12th Int. WWW Conf., 544–555
Fiebig T, Helmer S, Kanne C-C, Moerkotte G, Neumann J, Schiele R, Westmann T (2003) Natix: A Technology Overview. Lecture Notes in Computer Science 2593:12–33, Springer
Florescu D, Kossmann D (1999) Storing and querying XML data using an RDBMS. IEEE Data Eng Bull 22:27–34
Georgiadis H, Vassalos V (2007) XPath on Steroids: Exploiting Relational Engines for XPath Performance. Proc. ACM SIGMOD Conf., 317–328
Goldman R, Widom J (1997) DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. Proc. Int. Conf. on Very Large Data Bases (VLDB), 436–445
Graefe G, Larson P-A (2001) B-Tree Indexes and CPU Caches. Proc. Int. Conf. on Data Engineering (ICDE), 349–358
Grust T, van Keulen M, Teubner J (2003) Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps. Proc. Int. Conf. on Very Large Data Bases (VLDB), 524–525
Härder T, Haustein MP, Mathis C, Wagner M (2007) Node Labeling Schemes for Dynamic XML Documents Reconsidered. Data Knowl Eng 60(1):126–149
Härder T, Mathis C, Schmidt K (2007) Comparison of Complete and Elementless Native Storage of XML Documents. Proc. Int. Database Engineering and Applications Symposium (IDEAS), 102–113
Haustein MP, Härder T (2007) An efficient infrastructure for native transactional XML processing. Data Knowl Eng 61(3):500–523
Haustein MP, Härder T (2008) Optimizing lock protocols for native XML processing. Data Knowl Eng 65(1):147–173
Izadi K, Härder T, Haghjoo M (2009) S3: Evaluation of tree-pattern queries supported by structural summaries. Data Knowl Eng 68(1):126–145
Jiang H, Wang W, Lu H, Xu Yu J (2003) Holistic Twig Joins on Indexed XML Documents. Proc. Int. Conf. on Very Large Data Bases (VLDB), 273–284
Kaushik R, Shenoy P, Bohannon P, Gudes E (2002) Exploiting Local Similarity for Indexing Paths in Graph-Structured Data. Proc. Int. Conf. on Data Engineering (ICDE), 129–140
Kaushik R, Krishnamurthy R, Naughton JF, Ramakrishnan R (2004) On the Integration of Structure Indexes and Inverted Lists. Proc. ACM SIGMOD Conf., 779–790
Li H-G, Aghili SA, Agrawal D, El Abbadi A (2006) FLUX: Content and Structure Matching of XPath Queries with Range Predicates. Proc. Int. XML Database Symposium (XSym), Lecture Notes in Computer Science, 4156, 61–76
Li C, Ling TW, Hu M (2008) Efficient updates in dynamic XML data: from binary string to quaternary string. VLDB J 17(3):573–601
Liefke H, Suciu D (2000) XMill: An Efficient Compressor for XML Data. Proc. ACM SIGMOD Conf., 153–164
Loeser H (2008) XML Storage – It’s the Flexibility, Stupid!. Computer Science colloquium, University of Kaiserslautern
Loeser H, Nicola M, Fitzgerald J (2009) Index Challenges in Native XML Database systems. in: Proc. German National Database Conf. (BTW), Münster, Lecture Notes in Informatics, GI-Edition
Lu J, Ling TW, Chan CY, Chen T (2005) From region encoding to extended Dewey: on efficient processing of XML twig pattern matching. Proc. Int. Conf. on Very Large Data Bases (VLDB), 193–204
Mathis C (2009) Storing, Indexing, and Processing XML Documents in Native XML Database Management Systems. Ph.D. thesis, University of Kaiserslautern
McHugh J, Widom J, Abiteboul S, Luo Q, Rajaraman A (1998) Indexing Semistructured Data. Technical report, Stanford University
Meier W (2002) eXist: An Open Source Native XML Database. Lecture Notes in Computer Science 2593:169–183, Springer
Mignet L, Barbosa D, Veltri P (2003) The XML Web: a First Study. Proc. 12th Int. WWW Conf., Budapest). http://www.cs.toronto.edu/ mignet/Publications/www2003.pdf
Miklau G (2006) XML Data Repository, http://www.cs.washington.edu/research/xmldatasets
Milo T, Suciu D (1999) Index Structures for Path Expressions. Proc. Int. Conf. on Database Theory (ICDT), 277–295
Ng W, Lam WY, Cheng J (2006) Comparative analysis of XML compression technologies. World Wide Web 9(1):5–33
O’Neil PE, O’Neil EJ, Pal S, Cseri I, Schaller G, Westbury N (2004) OrdPaths: Insert-Friendly XML Node Labels. Proc. ACM SIGMOD Conf., 903–908
Sample N, Cooper BF, Franklin MJ, Hjaltason GR, Shadmon M, Cohe L (2002) Managing Complex and Varied Data with the IndexFabric(tm). Proc. Int. Conf. on Data Engineering (ICDE), 492–493
Schmidt AR, Waas F, Kersten ML, Carey MJ, Manolescu I, Busse R (2002) XMark: A Benchmark for XML Data Management. Proc. Int. Conf. on Very Large Data Bases (VLDB), 974–985
Skibinski P, Swacha J (2007) Combining Efficient XML Compression with Query Processing, Proc. East European Conf. on Advances in Databases and Information Systems (ADBIS), 330–342
Staken K (2005) Xindice 1.1 User Guide
W3C Recommendations (2004) http://www.w3c.org
XML Path Language (XPath), Version 1.0. W3C Recommendation (Nov. 1999)
XQuery 1.0: An XML Query Language. W3C Recommendation (Jan. 2007)
Yoshikawa M, Amagasa T, Shimura T, Uemura S (2001) XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases. ACM Trans Internet Technol (TOIT) 1:110–141
Author information
Authors and Affiliations
Corresponding author
Additional information
Financial support by the Research Center (CM) 2 of the University of Kaiserslautern is acknowledged ( http://cmcm.uni-kl.de ).
CR subject classification
E.2, H.2.2, H.2.4
Rights and permissions
About this article
Cite this article
Mathis, C., Härder, T. & Schmidt, K. Storing and indexing XML documents upside down . Comp. Sci. Res. Dev. 24, 51–68 (2009). https://doi.org/10.1007/s00450-009-0056-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-009-0056-x