Abstract
XBW [10] is modular program for lossless compression that enables testing various combinations of algorithms. We obtained best results with XML parser creating dictionary of syllables or words combined with Burrows-Wheeler transform - hence the name XBW. The motivation for creating parser that handles non-valid XML and HTML files, has been system EGOTHOR [5] for full-text searching. On files of size approximately 20MB, formed by hundreds of web pages, we achieved twice the compression ratio of bzip2 while running only twice as long. For smaller files, XBW has very good results, compared with other programs, especially for languages with rich morphology such as Slovak or German. For any big textual files, our program has good balance of compression and run time.
Program XBW enables use of parser and coder with any implemented algorithm for compression. We have implemented Burrows-Wheeler transform which together with MTF and RLE forms block compression, dictionary methods LZC and LZSS, and finally statistical method PPM. Coder offers choice of Huffman and arithmetic coding.
This work was supported by Charles University Grant Agentur in the project ”Text compression” (GAUK no. 1607, section A) and by the Program ”Information Society” under project 1ET100300419.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Burrows, M., Wheeler, D.J.: A Block Sorting Loseless Data Compression Algorithm. Technical report, Digital Equipment Corporation, Palo Alto, CA, U.S.A (2003)
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)
Cheney, J.: Compressing XML with Multiplexed Hierarchical PPM Models. In: Storer, J.A., Cohn, M. (eds.) Proceedings of 2001 IEEE Data Compression Conference, p. 163. IEEE Computer Society Press, Los Alamitos, California, USA (2001)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: FOCS 2005. Proc. 46th Annual IEEE Symposium on Foundations of Computer Science, pp. 184–193 (2005)
Galamboš, L.: EGOTHOR, http://www.egothor.org/
Horspool, R.N.: Improving LZW. In: Storer, J.A., Reif, J.H. (eds.) Proceedings of 1991 IEEE Data Compression Conference, pp. 332–341. IEEE Computer Society Press, Los Alamitos, California, USA (1991)
Jones, D.W.: Application of splay trees to data compression. Communications of the ACM 31(8), 996–1007 (1988)
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)
Kao, T.H.: Improving suffix-array construction algorithms with applications. Master Thesis. Gunma University, Japan (2001)
Lánský, J., Šesták, R., Uzel, P., Kovalčin, S., Kumičák, P., Urban, T., Szabó, M.: XBW - Word-based compression of non-valid XML documents, http://xbw.sourceforge.net/
Lánský, J., Žemlička, M.: Compression of a Dictionary. In: Snášel, V., Richta, K., Pokorný, J. (eds.) Proceedings of the Dateso 2006 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, vol. 176, pp. 11–20 (2006)
Lánský, J., Žemlička, M.: Compression of a Set of Strings. In: Storer, J.A., Marcellin, M.W. (eds.) Proceedings of 2007 IEEE Data Compression Conference, p. 390. IEEE Computer Society Press, Los Alamitos, California, USA (2007)
Liefke, H., Suciu, D.: XMill: an Efficient Compressor for XML Data. In: Proceedings of ACM SIGMOD Conference, pp. 153–164 (2000)
Šesták, R.: Suffix Arrays for Large Alphabet. Master Thesis, Charles University in Prag (2007)
Storer, J., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM 29, 928–951 (1982)
The Open Group Base: iconv. Specifications Issue 6. IEEE Std 1003.1 (2004), http://www.gnu.org/software/libiconv/
Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–342 (1977)
Ziv, J., Lempel, A.: Compression of Individual Sequences via Variable-Rate Coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Šesták, R., Lánský, J. (2008). Compression of Concatenated Web Pages Using XBW. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds) SOFSEM 2008: Theory and Practice of Computer Science. SOFSEM 2008. Lecture Notes in Computer Science, vol 4910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77566-9_64
Download citation
DOI: https://doi.org/10.1007/978-3-540-77566-9_64
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77565-2
Online ISBN: 978-3-540-77566-9
eBook Packages: Computer ScienceComputer Science (R0)