Abstract
We describe a novel compression technique for natural language text collections which takes advantage of the information provided by edges when a graph is used to model the text. This technique is called edge-guided compression. We propose an algorithm that allows the text to be transformed in agreement with the edge-guided technique in conjunction with the spaceless words transformation. The result of these transformations is a PPM-friendly byte-stream that has to be codified with a PPM family encoder. The comparison with state-of-art compressors shows that our proposal is a competitive choice for medium and large natural language text collections.
This work was partially supported by the TIN2006-15071-C03-02 project from MCyT, Spain (first and third authors) and by the VA010B06 project from the C. Educación, JCyL, Spain (first author).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adiego, J., de la Fuente, P.: Mapping words into codewords on ppm. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 181–192. Springer, Heidelberg (2006)
Adiego, J., de la Fuente, P., Navarro, G.: Merging prediction by partial matching with structural contexts model. In: DCC 2004, p. 522 (2004)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley-Longman, Reading (1999)
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Englewood Cliffs, N.J (1990)
Bell, T.C., Moffat, A., Nevill-Manning, C., Witten, I.H., Zobel, J.: Data compression in full-text retrieval systems. Journal of the American Society for Information Science 44, 508–531 (1993)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: DCC 2001, pp. 163–172 (2001)
Clearly, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)
Harman, D.: Overview of the Third Text REtrieval Conference. In: Proc. Third Text REtrieval Conference (TREC-3), pp. 1–19. NIST Special Publication (1995) NIST Special Publication 500-207.
Heaps, H.S.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)
Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: Proc. ACM SIGMOD 2000, pp. 153–164. ACM Press, New York (2000)
Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)
Moffat, A., Isal, R.Y.K.: Word-based text compression using the Burrows–Wheeler transform. Information Processing & Management 41(5), 1175–1192 (2005)
Moura, E., Navarro, G., Ziviani, N.: Indexing compressed text. In: Proceedings of the Fourth South American Workshop on String Processing, pp. 95–111 (1997)
Shkarin, D.: PPM: One step to practicality. In: DCC 2002, pp. 202–211 (2002)
Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison–Wesley, Reading (1949)
Ziviani, N., Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. IEEE Computer 33(11), 37–44 (2000)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Adiego, J., Martínez-Prieto, M.A., de la Fuente, P. (2007). Edge-Guided Natural Language Text Compression. In: Ziviani, N., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol 4726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75530-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-75530-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75529-6
Online ISBN: 978-3-540-75530-2
eBook Packages: Computer ScienceComputer Science (R0)