Article

Compressing and searching XML data via two zips

Authors:
P. Ferragina

Univ. Pisa

Univ. Pisa
View Profile

,
F. Luccio

Univ. Pisa

Univ. Pisa
View Profile

,
G. Manzini

Univ. Piemonte Orientale

Univ. Piemonte Orientale
View Profile

,
S. Muthukrishnan

Rutgers Univ.

Rutgers Univ.
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 751–760https://doi.org/10.1145/1135777.1135891

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 751–760

ABSTRACT

XML is fast becoming the standard format to store, exchange and publish over the web, and is getting embedded in applications. Two challenges in handling XML are its size (the XML representation of a document is significantly larger than its native state) and the complexity of its search (XML search involves path and content searches on labeled tree structures). We address the basic problems of compression, navigation and searching of XML documents. In particular, we adopt recently proposed theoretical algorithms [11] for succinct tree representations to design and implement a compressed index for XML, called XBZIPiNDEX, in which the XML document is maintained in a highly compressed format, and both navigation and searching can be done uncompressing only a tiny fraction of the data. This solution relies on compressing and indexing two arrays derived from the XML data. With detailed experiments we compare this with other compressed XML indexing and searching engines to show that XBZIPiNDEX has compression ratio up to 35% better than the ones achievable by those other tools, and its time performance on some path and content search operations is order of magnitudes faster: few milliseconds over hundreds of MBs of XML files versus tens of seconds, on standard XML data sources.

References

http://xml.coverpages.org/xml.html.]]Google Scholar
J. Adiego, P. de la Fuente, and G. Navarro. Lempel-Ziv compression of structured text. In IEEE Data Compression Conference, 2004.]] Google ScholarDigital Library
J. Adiego, P. de la Fuente, and G. Navarro. Merging prediction by partial matching with structural contexts model. In IEEE Data Compression Conference, page 522, 2004.]] Google ScholarDigital Library
A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese. XQueC: pushing queries to compressed XML data. In VLDB, 2003.]]Google Scholar
D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 2005.]]Google ScholarCross Ref
B. Catania, A. Maddalena, and A. Vakali. XML document indexes: a classification. In IEEE Internet Computing, pages 64--71, September-October 2005.]] Google ScholarDigital Library
T. Chen, J. Lu, and T. W. Lin. On boosting holism in XML twig pattern matching using structural indexing techniques. In ACM Sigmod, pages 455--466, 2005.]] Google ScholarDigital Library
J. Cheney. Compressing XML with multiplexed hierarchical PPM models. In IEEE Data Compression Conference, pages 163--172, 2001.]] Google ScholarDigital Library
J. Cheney. An empirical evaluation of simple DTD-conscious compression techniques. In WebDB, 2005.]]Google Scholar
J. Cheng and W. Ng. XQzip: Querying compressed XML using structural indexing. In International Conference on Extending Database Technology, pages 219--236, 2004.]]Google ScholarCross Ref
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In IEEE Focs, pages 184--193, 2005.]] Google ScholarDigital Library
P. Ferragina and G. Manzini. An experimental study of a compressed index. Information Sciences, 135:13--28, 2001.]] Google ScholarDigital Library
P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552--581, 2005.]] Google ScholarDigital Library
R. F. Geary, R. Raman, and V. Raman. Succinct ordinal trees with level-ancestor queries. In ACM-SIAM Soda, 2004.]] Google ScholarDigital Library
D. Geer. Will binary XML speed network traffic? IEEE Computer, pages 16--18, April 2005.]] Google ScholarDigital Library
R. Goldman and J. Widom. Dataguides: enabling query formulation and optimization in semistructured databases. In VLDB, pages 436--445, 1997.]] Google ScholarDigital Library
A. Golinsky, I. Munro, and S. Rao. Rank/Select operations on large alphabets: a tool for text indexing. In ACM-SIAM SODA, 2006.]] Google ScholarDigital Library
R. Kaushik, R. Krishnamurthy, J. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ACM Sigmod, pages 779--790, 2004.]] Google ScholarDigital Library
R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ACM Sigmod, 2004.]] Google ScholarDigital Library
W. Y. Lam, W. Ng, P. T. Wood, and M. Levene. XCQ: XML compression and querying system. In WWW, 2003.]]Google Scholar
H. Liefke and D. Suciu. XMILL: An efficient compressor for XML data. In ACM Sigmod, pages 153--164, 2000.]] Google ScholarDigital Library
T. Milo and D. Suciu. Index structures for path expressions. In ICDT, pages 277--295, 1999.]] Google ScholarDigital Library
Jun-Ki Min, Myung-Jae Park, and Chin-Wan Chung. Xpress: A queriable compression for XML data. In ACM Sigmod, pages 122--133, 2003.]] Google ScholarDigital Library
E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113--139, 2000.]] Google ScholarDigital Library
P. R. Raw and B. Moon. PRIX: Indexing and querying XML using Pr"ufer sequences. In ICDE, pages 288--300, 2004.]] Google ScholarDigital Library
D. Shkarin. PPM: One step to practicality. In IEEE Data Compression Conference, pages 202--211, 2002.]] Google ScholarDigital Library
P. M. Tolani and J. R. Haritsa. XGRIND: A query-friendly XML compressor. In ICDE, pages 225--234, 2002.]] Google ScholarDigital Library
H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: a dynamic index methd for querying XML data by tree structures. In ACM Sigmod, pages 110--121, 2003.]] Google ScholarDigital Library
W. Wang, H. Wang, H. Lu, H. Jang, X. Lin, and J. Li. Efficient processing of XML path queries using the disk-based F&B index. In VLDB, pages 145--156, 2005.]] Google ScholarDigital Library

Index Terms

Compressing and searching XML data via two zips

Recommendations

Compressing XML with Multiplexed Hierarchical PPM Models
DCC '01: Proceedings of the Data Compression Conference

Abstract: Extensible Markup Language (XML) is a standardized language that "describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them." According to Bosak and Bray, XML is the "...
Read More
Exchanging intensional XML data
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

XML is becoming the universal format for data exchange between applications. Recently, the emergence of Web services as standard means of publishing and accessing data on the Web introduced a new class of XML documents, which we call intensional ...
Read More
XML: Visual QuickStart Guide
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
XML compression and indexing
labeled trees
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 773
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compressing and searching XML data via two zips

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Compressing XML with Multiplexed Hierarchical PPM Models

Exchanging intensional XML data

XML: Visual QuickStart Guide