Skip to main content
Log in

Storing and indexing XML documents upside down

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

XML documents contain substantial redundancy in their structure part, because each path from the root node to a leaf node is explicitly represented and typically large sets of such path instances belong to a path class, i.e., the nodes of the path instances are labeled by the same sequence of element (or attribute) names. To save storage space and I/O cost, we want to get rid of this structural redundancy to the extent possible. While all known methods for the physical representation (storage) of XML documents proceed from the root via the element/attribute hierarchy (internal nodes) down to the leaves (values), we follow an upside-down approach which explicitly stores the values and only reconstructs the internal nodes, if needed. The cornerstones for such a solution are suitable node labels and a path synopsis which efficiently represents all path classes of an XML document. As a solution, we propose a compact internal storage format for native XML database systems where the inner structure of the stored documents is virtualized. Because this elementless storage format provides an efficient reconstruction of a document using its path synopsis, all processing properties are preserved and the semantics of navigational and declarative operations of XML languages remains unchanged. Adjusted indexes support the full spectrum of so-called content-and-structure single path queries. Apart from greatly reduced storage consumption, our approach demonstrates its superiority, compared to competing methods, not only for a substantial fraction of those queries, but also for storing, reconstructing, and navigating XML documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Al-Khalifa S, Jagadish HV, Patel JM, Wu Y, Koudas N, Srivastava D (2002) Structural Joins: A Primitive for Efficient XML Query Pattern Matching. Proc. Int. Conf. on Data Engineering (ICDE), 141–152

  2. Arion A, Bonifati A, Manolescu I, Pugliese A (2008) Path Summaries and Path Partitioning in Modern XML Databases. World Wide Web 11(1):117–151

    Article  Google Scholar 

  3. Beyer KS, Cochrane R, Josifovski V, Kleewein J, Lapis G, Lohman GM, Lyle R, Özcan F, Pirahesh H, Seemann N, Truong TC, Van der Linden B, Vickery B, Zhang C (2005) System RX: One Part Relational, One Part XML, Proc. ACM SIGMOD Conf., 374–358

  4. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426

    Article  MATH  Google Scholar 

  5. Böhme T, Rahm E (2004) Supporting Efficient Streaming and Insertion of XML Data in RDBMS. Proc. 3rd DIWeb Workshop, 70–81

  6. Bruno N, Koudas N, Srivastava D (2002) Holistic Twig Joins: Optimal XML Pattern Matching. Proc. ACM SIGMOD Conf., 310–321

  7. Christophides V, Plexousakis D, Scholl M, Tourtounis S (2003) On Labeling Schemes for the Semantic Web. Proc. 12th Int. WWW Conf., 544–555

  8. Fiebig T, Helmer S, Kanne C-C, Moerkotte G, Neumann J, Schiele R, Westmann T (2003) Natix: A Technology Overview. Lecture Notes in Computer Science 2593:12–33, Springer

  9. Florescu D, Kossmann D (1999) Storing and querying XML data using an RDBMS. IEEE Data Eng Bull 22:27–34

    Google Scholar 

  10. Georgiadis H, Vassalos V (2007) XPath on Steroids: Exploiting Relational Engines for XPath Performance. Proc. ACM SIGMOD Conf., 317–328

  11. Goldman R, Widom J (1997) DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. Proc. Int. Conf. on Very Large Data Bases (VLDB), 436–445

  12. Graefe G, Larson P-A (2001) B-Tree Indexes and CPU Caches. Proc. Int. Conf. on Data Engineering (ICDE), 349–358

  13. Grust T, van Keulen M, Teubner J (2003) Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps. Proc. Int. Conf. on Very Large Data Bases (VLDB), 524–525

  14. Härder T, Haustein MP, Mathis C, Wagner M (2007) Node Labeling Schemes for Dynamic XML Documents Reconsidered. Data Knowl Eng 60(1):126–149

    Article  Google Scholar 

  15. Härder T, Mathis C, Schmidt K (2007) Comparison of Complete and Elementless Native Storage of XML Documents. Proc. Int. Database Engineering and Applications Symposium (IDEAS), 102–113

  16. Haustein MP, Härder T (2007) An efficient infrastructure for native transactional XML processing. Data Knowl Eng 61(3):500–523

    Article  Google Scholar 

  17. Haustein MP, Härder T (2008) Optimizing lock protocols for native XML processing. Data Knowl Eng 65(1):147–173

    Google Scholar 

  18. Izadi K, Härder T, Haghjoo M (2009) S3: Evaluation of tree-pattern queries supported by structural summaries. Data Knowl Eng 68(1):126–145

    Article  Google Scholar 

  19. Jiang H, Wang W, Lu H, Xu Yu J (2003) Holistic Twig Joins on Indexed XML Documents. Proc. Int. Conf. on Very Large Data Bases (VLDB), 273–284

  20. Kaushik R, Shenoy P, Bohannon P, Gudes E (2002) Exploiting Local Similarity for Indexing Paths in Graph-Structured Data. Proc. Int. Conf. on Data Engineering (ICDE), 129–140

  21. Kaushik R, Krishnamurthy R, Naughton JF, Ramakrishnan R (2004) On the Integration of Structure Indexes and Inverted Lists. Proc. ACM SIGMOD Conf., 779–790

  22. Li H-G, Aghili SA, Agrawal D, El Abbadi A (2006) FLUX: Content and Structure Matching of XPath Queries with Range Predicates. Proc. Int. XML Database Symposium (XSym), Lecture Notes in Computer Science, 4156, 61–76

  23. Li C, Ling TW, Hu M (2008) Efficient updates in dynamic XML data: from binary string to quaternary string. VLDB J 17(3):573–601

    Article  Google Scholar 

  24. Liefke H, Suciu D (2000) XMill: An Efficient Compressor for XML Data. Proc. ACM SIGMOD Conf., 153–164

  25. Loeser H (2008) XML Storage – It’s the Flexibility, Stupid!. Computer Science colloquium, University of Kaiserslautern

  26. Loeser H, Nicola M, Fitzgerald J (2009) Index Challenges in Native XML Database systems. in: Proc. German National Database Conf. (BTW), Münster, Lecture Notes in Informatics, GI-Edition

  27. Lu J, Ling TW, Chan CY, Chen T (2005) From region encoding to extended Dewey: on efficient processing of XML twig pattern matching. Proc. Int. Conf. on Very Large Data Bases (VLDB), 193–204

  28. Mathis C (2009) Storing, Indexing, and Processing XML Documents in Native XML Database Management Systems. Ph.D. thesis, University of Kaiserslautern

  29. McHugh J, Widom J, Abiteboul S, Luo Q, Rajaraman A (1998) Indexing Semistructured Data. Technical report, Stanford University

  30. Meier W (2002) eXist: An Open Source Native XML Database. Lecture Notes in Computer Science 2593:169–183, Springer

  31. Mignet L, Barbosa D, Veltri P (2003) The XML Web: a First Study. Proc. 12th Int. WWW Conf., Budapest). http://www.cs.toronto.edu/ mignet/Publications/www2003.pdf

  32. Miklau G (2006) XML Data Repository, http://www.cs.washington.edu/research/xmldatasets

  33. Milo T, Suciu D (1999) Index Structures for Path Expressions. Proc. Int. Conf. on Database Theory (ICDT), 277–295

  34. Ng W, Lam WY, Cheng J (2006) Comparative analysis of XML compression technologies. World Wide Web 9(1):5–33

    Article  Google Scholar 

  35. O’Neil PE, O’Neil EJ, Pal S, Cseri I, Schaller G, Westbury N (2004) OrdPaths: Insert-Friendly XML Node Labels. Proc. ACM SIGMOD Conf., 903–908

  36. Sample N, Cooper BF, Franklin MJ, Hjaltason GR, Shadmon M, Cohe L (2002) Managing Complex and Varied Data with the IndexFabric(tm). Proc. Int. Conf. on Data Engineering (ICDE), 492–493

  37. Schmidt AR, Waas F, Kersten ML, Carey MJ, Manolescu I, Busse R (2002) XMark: A Benchmark for XML Data Management. Proc. Int. Conf. on Very Large Data Bases (VLDB), 974–985

  38. Skibinski P, Swacha J (2007) Combining Efficient XML Compression with Query Processing, Proc. East European Conf. on Advances in Databases and Information Systems (ADBIS), 330–342

  39. Staken K (2005) Xindice 1.1 User Guide

  40. W3C Recommendations (2004) http://www.w3c.org

  41. XML Path Language (XPath), Version 1.0. W3C Recommendation (Nov. 1999)

  42. XQuery 1.0: An XML Query Language. W3C Recommendation (Jan. 2007)

  43. Yoshikawa M, Amagasa T, Shimura T, Uemura S (2001) XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases. ACM Trans Internet Technol (TOIT) 1:110–141

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Mathis.

Additional information

Financial support by the Research Center (CM) 2 of the University of Kaiserslautern is acknowledged ( http://cmcm.uni-kl.de ).

CR subject classification

E.2, H.2.2, H.2.4

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mathis, C., Härder, T. & Schmidt, K. Storing and indexing XML documents upside down . Comp. Sci. Res. Dev. 24, 51–68 (2009). https://doi.org/10.1007/s00450-009-0056-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-009-0056-x

Keywords

Navigation