Skip to main content
Log in

XCQ: A queriable XML compression system

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

XML has already become the de facto standard for specifying and exchanging data on the Web. However, XML is by nature verbose and thus XML documents are usually large in size, a factor that hinders its practical usage, since it substantially increases the costs of storing, processing, and exchanging data. In order to tackle this problem, many XML-specific compression systems, such as XMill, XGrind, XMLPPM, and Millau, have recently been proposed. However, these systems usually suffer from the following two inadequacies: They either sacrifice performance in terms of compression ratio and execution time in order to support a limited range of queries, or perform full decompression prior to processing queries over compressed documents.

In this paper, we address the above problems by exploiting the information provided by a Document Type Definition (DTD) associated with an XML document. We show that a DTD is able to facilitate better compression as well as generate more usable compressed data to support querying. We present the architecture of the XCQ, which is a compression and querying tool for handling XML data. XCQ is based on a novel technique we have developed called DTD Tree and SAX Event Stream Parsing (DSP). The documents compressed by XCQ are stored in Partitioned Path-Based Grouping (PPG) data streams, which are equipped with a Block Statistics Signature (BSS) indexing scheme. The indexed PPG data streams support the processing of XML queries that involve selection and aggregation, without the need for full decompression. In order to study the compression performance of XCQ, we carry out comprehensive experiments over a set of XML benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Apache Software Foundation (2005) Log Files—Apache HTTP Server. http://httpd.apache.org/docs/logs.html

  2. Arion A, Bonifati A, Costa G, D'Aguanno S, Manolescu I, Pugliese A (2004) Efficient query evaluation over compressed XML data. In: Bertino E, Christodoulakis S, Plexousakis D, Christophides V, Koubarakis M, Böhm K, Ferrari E (eds) Proceedings of Advances in Database Technology (EDBT 2004), 9th international conference on extending database technology, Heraklion, Crete, Greece, March, 2004. Lecture Notes in Computer Science 2992, Springer, Berlin Heidelberg New York, pp 200–218

  3. Bell TC, Cleary JG, Witten IH (1990) Text compression. Prentice Hall, Englewood Cliffs, New Jersey, USA

  4. Boag S, Chamberlin D, Fernández MF, Florescu D, Robie J, Siméon J (eds) (2005) XQuery 1.0: An XML query language. W3C Working Draft. http://www.w3.org/TR/xquery

  5. Bosak J (1999) Shakespeare 2.00. http://www.cs.wisc.edu/niagara/data/shakes/shaksper.htm

  6. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (eds) (2004) Extensible markup language (XML) 1.0, 3rd edn. W3C Recommendation. http://www.w3.org/TR/REC-xml

  7. Buneman P, Grohe M, Koch C (2003) Path queries on compressed XML. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A (eds) Proceedings of the 29th international conference on very large data bases, Berlin, Germany, pp 141–152

  8. Buneman P, Choi B, Fan W, Hutchison R, Mann R, Viglas S (2005) Vectorizing and querying large XML repositories. In: Proceedings of the 21th international conference on data engineering, Tokyo, Japan, pp 261–272

  9. Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical report 124, SRC. Digital Equipment Corporation, Palo Alto, California

  10. Cannataro M, Comito C, Pugliese A (2002) SqueezeX: Synthesis and compression of XML data. In: Proceedings of the IEEE international conference on information technology: coding and computing, Las Vegas, USA, pp 326–331

  11. Cheney J (2001) Compressing XML with multiplexed hierarchical PPM models. In: Proceedings of the IEEE data compression conference, Snowbird, UT, USA, pp 163–172

  12. Clarke J (2004) The Expat XML parser. http://expat.sourceforge.net/

  13. Clark J, DeRose S (eds) (1999) XML path language (XPath). Version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath

  14. Cleary J, Teahan W, Witten I (1995) Unbounded length contexts for PPM. In: Storer JA, Cohn M (eds) Proceedings of the IEEE data compression conference, Snowbird, UT, USA, pp 52–61

  15. Datta A, Thomas H (1999) Accessing data in block-compressed data warehouses. In: Proceedings of the 9th workshop on information technologies and systems (WITS), Charlotte, North Carolina, USA

  16. DTDParser—A Java DTD Parser (2005) http://www.wutka.com/dtdparser.html

  17. Faloutsos C, Christodoulakis S (1985) Design of a signature file method that accounts for non-uniform occurrence and query frequencies. In: Pirotte A, Vassiliou Y (eds) Proceedings of the 11th international conference on very large data bases, Stockholm, Sweden, pp 165–170

  18. Gailly J-L, Adler M (2003) gzip 1.2.4. http://www.gzip.org/

  19. Gailly J-L, Adler M (2003) zlib 1.1.4. http://www.gzip.org/zlib/

  20. Garofalakis M, Gionis A, Rastogi R, Seshadri S, Shim K (2003) XTRACT: Learning document type descriptors from XML document collections. Data Min Knowl Discovery 7:23–56

    Article  MathSciNet  Google Scholar 

  21. Girardot M, Sundaresan N (2000) Millau: An encoding format for efficient representation and exchange of XML over the Web. In: Proceedings of the 9th international world wide web conference, Amsterdam, The Netherlands, pp 747–765

  22. Girardot M, Sundaresan N (2000) Efficient representation and streaming of XML content over the internet medium. In: Proceedings of the IEEE international conference on multimedia and expo (I), New York, NY, USA, pp 67–70

  23. Goldman R, Widom J (1997) DataGuides: Enabling query formation and optimization in semistructured databases. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) Proceedings of the 23rd international conference on very large data bases, Athens, Greece, pp 436–445

  24. Huffman DA (1952) A method for construction of minimum-redundancy codes. Proceed. IRE 40:1098–1101

    Google Scholar 

  25. Ishikawa H, Yokoyama S, Isshiki S, Ohta M (2001) Project Xanadu: XML- and active-database-unified approach to distributed E-Commerce. In: Tjoa AM, Wagner R (eds) Proceedings of the 12th international workshop on database and expert systems applications, Munich, Germany, pp 833–837

  26. Iyer B, Wilhite D (1994) Data compression support in databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases, Santiago de Chile, Chile, pp 695–704

  27. Java Technology (2005) http://java.sun.com/

  28. Lam WY, Ng W, Wood PT, Levene M (2003) XCQ: XML compression and querying system. In: Poster proceedings of the 12th international world wide web conference, Budapest, Hungary

  29. Levene M, Wood PT (2002) XML structure compression. In: Proceedings of the second international workshop on web dynamics, Honolulu, Hawaii

  30. Ley M (2005) DBLP. http://dblp.uni-trier.de/

  31. Liefke H, Suciu D (2000) XMill: An efficient compressor for XML Data. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, Texas, USA, pp 153–164

  32. Lin Z, Faloutsos C (1992) Frame-sliced signature files. IEEE Trans Knowl Data Eng 4(3):281–289

    Article  Google Scholar 

  33. Martin B, Jano B (1999) WAP binary XML content format. W3C NOTE. http://www.w3.org/TR/wbxml/

  34. Megginson D (2004) SAX. http://www.saxproject.org/

  35. Min JK, Park MJ, Chung CW (2003). XPRESS: A queriable compression for XML data. In: Halevy AY, Ives ZG, Doan A (eds) Proceedings of the ACM SIGMOD international conference on management of data, San Diego, California, USA, pp 122–133

  36. Ng WK, Ravishankar C (1997) Block-oriented compression techniques for large statistical databases. IEEE Trans Knowl Data Eng 9(2):314–328

    Article  Google Scholar 

  37. Poess M, Potapov D (2003) Data compression in Oracle. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A (eds) Proceedings of the 29th international conference on very large data bases, Berlin, Germany, pp 937–947

  38. Schefler WC (1988) Statistics: Concepts and applications. The Benjamin-Cummings Publishing Co., Inc., Redwood City, California, USA

    Google Scholar 

  39. Segoufin L, Vianu V (2002) Validating streaming XML documents. In: Popa L (ed) Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Madison, Wisconsin, USA, pp 53–64

  40. Seward J (2005) bzip2 and libbzip2. http://www.bzip.org/

  41. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656

    Google Scholar 

  42. Sundaresan N, Moussa R (2001) Algorithms and programming models for efficient representation of XML for Internet applications. In: Proceedings of the 10th international world wide web conference, Hong Kong, China, pp 366–375

  43. Swiss-Prot Protein Knowledgebase (2005) http://www.expasy.ch/sprot/

  44. TAR (2004) http://www.gnu.org/software/tar/

  45. Tolani PM, Haritsa JR (2002) XGRIND: A query-friendly XML compressor. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, pp 225-234

  46. Transaction Processing Performance Council (2004) TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/default.asp

  47. XMark—An XML Benchmark Project (2003) http://monetdb.cwi.nl/xml/

  48. XML Solutions (2000) XMLZIP. http://www.xmls.com/

  49. XCQ Appendix (2005) Experimental data of XCQ performance. http://www.cs.ust.hk/~wilfred/XCQ/appendix.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wilfred Ng.

Additional information

Wilfred Ng obtained his M.Sc.(Distinction) and Ph.D. degrees from the University of London. His research interests are in the areas of databases and information Systems, which include XML data, database query languages, web data management, and data mining. He is now an assistant professor in the Department of Computer Science, the Hong Kong University of Science and Technology (HKUST). Further Information can be found at the following URL: http://www.cs.ust.hk/faculty/wilfred/index.html.

Wai-Yeung Lam obtained his M.Phil. degree from the Hong Kong University of Science and Technology (HKUST) in 2003. His research thesis was based on the project “XCQ: A Framework for Querying Compressed XML Data.” He is currently working in industry.

Peter Wood received his Ph.D. in Computer Science from the University of Toronto in 1989. He has previously studied at the University of Cape Town, South Africa, obtaining a B.Sc. degree in 1977 and an M.Sc. degree in Computer Science in 1982. Currently he is a senior lecturer at Birkbeck and a member of the Information Management and Web Technologies research group. His research interests include database and XML query languages, query optimisation, active and deductive rule languages, and graph algorithms.

Mark Levene received his Ph.D. in Computer Science in 1990 from Birkbeck College, University of London, having previously been awarded a B.Sc. in Computer Science from Auckland University, New Zealand in 1982. He is currently professor of Computer Science at Birkbeck College, where he is a member of the Information Management and Web Technologies research group. His main research interests are Web search and navigation, Web data mining and stochastic models for the evolution of the Web. He has published extensively in the areas of database theory and web technologies, and has recently published a book called ‘An Introduction to Search Engines and Web Navigation’.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ng, W., Lam, WY., Wood, P.T. et al. XCQ: A queriable XML compression system. Knowl Inf Syst 10, 421–452 (2006). https://doi.org/10.1007/s10115-006-0012-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0012-z

Keywords

Navigation