Skip to main content

Compression of Concatenated Web Pages Using XBW

  • Conference paper
SOFSEM 2008: Theory and Practice of Computer Science (SOFSEM 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4910))

  • 1281 Accesses

Abstract

XBW [10] is modular program for lossless compression that enables testing various combinations of algorithms. We obtained best results with XML parser creating dictionary of syllables or words combined with Burrows-Wheeler transform - hence the name XBW. The motivation for creating parser that handles non-valid XML and HTML files, has been system EGOTHOR [5] for full-text searching. On files of size approximately 20MB, formed by hundreds of web pages, we achieved twice the compression ratio of bzip2 while running only twice as long. For smaller files, XBW has very good results, compared with other programs, especially for languages with rich morphology such as Slovak or German. For any big textual files, our program has good balance of compression and run time.

Program XBW enables use of parser and coder with any implemented algorithm for compression. We have implemented Burrows-Wheeler transform which together with MTF and RLE forms block compression, dictionary methods LZC and LZSS, and finally statistical method PPM. Coder offers choice of Huffman and arithmetic coding.

This work was supported by Charles University Grant Agentur in the project ”Text compression” (GAUK no. 1607, section A) and by the Program ”Information Society” under project 1ET100300419.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Burrows, M., Wheeler, D.J.: A Block Sorting Loseless Data Compression Algorithm. Technical report, Digital Equipment Corporation, Palo Alto, CA, U.S.A (2003)

    Google Scholar 

  2. Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)

    Article  Google Scholar 

  3. Cheney, J.: Compressing XML with Multiplexed Hierarchical PPM Models. In: Storer, J.A., Cohn, M. (eds.) Proceedings of 2001 IEEE Data Compression Conference, p. 163. IEEE Computer Society Press, Los Alamitos, California, USA (2001)

    Chapter  Google Scholar 

  4. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: FOCS 2005. Proc. 46th Annual IEEE Symposium on Foundations of Computer Science, pp. 184–193 (2005)

    Google Scholar 

  5. Galamboš, L.: EGOTHOR, http://www.egothor.org/

  6. Horspool, R.N.: Improving LZW. In: Storer, J.A., Reif, J.H. (eds.) Proceedings of 1991 IEEE Data Compression Conference, pp. 332–341. IEEE Computer Society Press, Los Alamitos, California, USA (1991)

    Chapter  Google Scholar 

  7. Jones, D.W.: Application of splay trees to data compression. Communications of the ACM 31(8), 996–1007 (1988)

    Article  Google Scholar 

  8. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Kao, T.H.: Improving suffix-array construction algorithms with applications. Master Thesis. Gunma University, Japan (2001)

    Google Scholar 

  10. Lánský, J., Šesták, R., Uzel, P., Kovalčin, S., Kumičák, P., Urban, T., Szabó, M.: XBW - Word-based compression of non-valid XML documents, http://xbw.sourceforge.net/

  11. Lánský, J., Žemlička, M.: Compression of a Dictionary. In: Snášel, V., Richta, K., Pokorný, J. (eds.) Proceedings of the Dateso 2006 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, vol. 176, pp. 11–20 (2006)

    Google Scholar 

  12. Lánský, J., Žemlička, M.: Compression of a Set of Strings. In: Storer, J.A., Marcellin, M.W. (eds.) Proceedings of 2007 IEEE Data Compression Conference, p. 390. IEEE Computer Society Press, Los Alamitos, California, USA (2007)

    Chapter  Google Scholar 

  13. Liefke, H., Suciu, D.: XMill: an Efficient Compressor for XML Data. In: Proceedings of ACM SIGMOD Conference, pp. 153–164 (2000)

    Google Scholar 

  14. Šesták, R.: Suffix Arrays for Large Alphabet. Master Thesis, Charles University in Prag (2007)

    Google Scholar 

  15. Storer, J., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM 29, 928–951 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  16. The Open Group Base: iconv. Specifications Issue 6. IEEE Std 1003.1 (2004), http://www.gnu.org/software/libiconv/

  17. Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–342 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  18. Ziv, J., Lempel, A.: Compression of Individual Sequences via Variable-Rate Coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Viliam Geffert Juhani Karhumäki Alberto Bertoni Bart Preneel Pavol Návrat Mária Bieliková

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Šesták, R., Lánský, J. (2008). Compression of Concatenated Web Pages Using XBW. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds) SOFSEM 2008: Theory and Practice of Computer Science. SOFSEM 2008. Lecture Notes in Computer Science, vol 4910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77566-9_64

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77566-9_64

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77565-2

  • Online ISBN: 978-3-540-77566-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics