ABSTRACT
XML is fast becoming the standard format to store, exchange and publish over the web, and is getting embedded in applications. Two challenges in handling XML are its size (the XML representation of a document is significantly larger than its native state) and the complexity of its search (XML search involves path and content searches on labeled tree structures). We address the basic problems of compression, navigation and searching of XML documents. In particular, we adopt recently proposed theoretical algorithms [11] for succinct tree representations to design and implement a compressed index for XML, called XBZIPiNDEX, in which the XML document is maintained in a highly compressed format, and both navigation and searching can be done uncompressing only a tiny fraction of the data. This solution relies on compressing and indexing two arrays derived from the XML data. With detailed experiments we compare this with other compressed XML indexing and searching engines to show that XBZIPiNDEX has compression ratio up to 35% better than the ones achievable by those other tools, and its time performance on some path and content search operations is order of magnitudes faster: few milliseconds over hundreds of MBs of XML files versus tens of seconds, on standard XML data sources.
- http://xml.coverpages.org/xml.html.]]Google Scholar
- J. Adiego, P. de la Fuente, and G. Navarro. Lempel-Ziv compression of structured text. In IEEE Data Compression Conference, 2004.]] Google ScholarDigital Library
- J. Adiego, P. de la Fuente, and G. Navarro. Merging prediction by partial matching with structural contexts model. In IEEE Data Compression Conference, page 522, 2004.]] Google ScholarDigital Library
- A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese. XQueC: pushing queries to compressed XML data. In VLDB, 2003.]]Google Scholar
- D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 2005.]]Google ScholarCross Ref
- B. Catania, A. Maddalena, and A. Vakali. XML document indexes: a classification. In IEEE Internet Computing, pages 64--71, September-October 2005.]] Google ScholarDigital Library
- T. Chen, J. Lu, and T. W. Lin. On boosting holism in XML twig pattern matching using structural indexing techniques. In ACM Sigmod, pages 455--466, 2005.]] Google ScholarDigital Library
- J. Cheney. Compressing XML with multiplexed hierarchical PPM models. In IEEE Data Compression Conference, pages 163--172, 2001.]] Google ScholarDigital Library
- J. Cheney. An empirical evaluation of simple DTD-conscious compression techniques. In WebDB, 2005.]]Google Scholar
- J. Cheng and W. Ng. XQzip: Querying compressed XML using structural indexing. In International Conference on Extending Database Technology, pages 219--236, 2004.]]Google ScholarCross Ref
- P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In IEEE Focs, pages 184--193, 2005.]] Google ScholarDigital Library
- P. Ferragina and G. Manzini. An experimental study of a compressed index. Information Sciences, 135:13--28, 2001.]] Google ScholarDigital Library
- P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552--581, 2005.]] Google ScholarDigital Library
- R. F. Geary, R. Raman, and V. Raman. Succinct ordinal trees with level-ancestor queries. In ACM-SIAM Soda, 2004.]] Google ScholarDigital Library
- D. Geer. Will binary XML speed network traffic? IEEE Computer, pages 16--18, April 2005.]] Google ScholarDigital Library
- R. Goldman and J. Widom. Dataguides: enabling query formulation and optimization in semistructured databases. In VLDB, pages 436--445, 1997.]] Google ScholarDigital Library
- A. Golinsky, I. Munro, and S. Rao. Rank/Select operations on large alphabets: a tool for text indexing. In ACM-SIAM SODA, 2006.]] Google ScholarDigital Library
- R. Kaushik, R. Krishnamurthy, J. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ACM Sigmod, pages 779--790, 2004.]] Google ScholarDigital Library
- R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ACM Sigmod, 2004.]] Google ScholarDigital Library
- W. Y. Lam, W. Ng, P. T. Wood, and M. Levene. XCQ: XML compression and querying system. In WWW, 2003.]]Google Scholar
- H. Liefke and D. Suciu. XMILL: An efficient compressor for XML data. In ACM Sigmod, pages 153--164, 2000.]] Google ScholarDigital Library
- T. Milo and D. Suciu. Index structures for path expressions. In ICDT, pages 277--295, 1999.]] Google ScholarDigital Library
- Jun-Ki Min, Myung-Jae Park, and Chin-Wan Chung. Xpress: A queriable compression for XML data. In ACM Sigmod, pages 122--133, 2003.]] Google ScholarDigital Library
- E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113--139, 2000.]] Google ScholarDigital Library
- P. R. Raw and B. Moon. PRIX: Indexing and querying XML using Pr"ufer sequences. In ICDE, pages 288--300, 2004.]] Google ScholarDigital Library
- D. Shkarin. PPM: One step to practicality. In IEEE Data Compression Conference, pages 202--211, 2002.]] Google ScholarDigital Library
- P. M. Tolani and J. R. Haritsa. XGRIND: A query-friendly XML compressor. In ICDE, pages 225--234, 2002.]] Google ScholarDigital Library
- H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: a dynamic index methd for querying XML data by tree structures. In ACM Sigmod, pages 110--121, 2003.]] Google ScholarDigital Library
- W. Wang, H. Wang, H. Lu, H. Jang, X. Lin, and J. Li. Efficient processing of XML path queries using the disk-based F&B index. In VLDB, pages 145--156, 2005.]] Google ScholarDigital Library
Index Terms
- Compressing and searching XML data via two zips
Recommendations
Compressing XML with Multiplexed Hierarchical PPM Models
DCC '01: Proceedings of the Data Compression ConferenceAbstract: Extensible Markup Language (XML) is a standardized language that "describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them." According to Bosak and Bray, XML is the "...
Exchanging intensional XML data
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of dataXML is becoming the universal format for data exchange between applications. Recently, the emergence of Web services as standard means of publishing and accessing data on the Web introduced a new class of XML documents, which we call intensional ...
Comments