Skip to main content

Mapping Words into Codewords on PPM

  • Conference paper
Book cover String Processing and Information Retrieval (SPIRE 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Included in the following conference series:

Abstract

We describe a simple and efficient scheme which allows words to be managed in PPM modelling when a natural language text file is being compressed. The main idea for managing words is to assign them codes to make them easier to manipulate. A general technique is used to obtain this objective: a dictionary mapping on PPM modelling. In order to test our idea, we are implementing three prototypes: one implements the basic dictionary mapping on PPM, another implements the dictionary mapping with the separate alphabets model and the last one implements the dictionary with the spaceless words model. This technique can be applied directly or it can be combined with some word compression model. The results for files of 1 Mb. and over are better than those achieved by the character PPM which was taken as a base. The comparison between different prototypes shows that the best option is to use a word based PPM in conjunction with the spaceless word concept.

This work was partially supported by the TIC2003-09268 project from MCyT, Spain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adiego, J., de la Fuente, P., Navarro, G.: Merging prediction by partial matching with structural contexts model. In: Proceedings of 14th Data Compression Conference (DCC 2004), p. 522 (2004)

    Google Scholar 

  2. Arnold, R., Bell, T.C.: A corpus for the evaluation of lossless compression algorithms. In: Proceedings of 7th Data Compression Conference (DCC 1997), pp. 201–210 (1997)

    Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison-Wesley-Longman (May 1999)

    Google Scholar 

  4. Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Englewood Cliffs (1990)

    Google Scholar 

  5. Bell, T.C., Moffat, A., Nevill-Manning, C., Witten, I.H., Zobel, J.: Data compression in full-text retrieval systems. Journal of the American Society for Information Science 44, 508–531 (1993)

    Article  Google Scholar 

  6. Bentley, J., Sleator, D., Tarjan, R., Wei, V.: A locally adaptive data compression scheme. Communications of the ACM 29, 320–330 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  7. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  8. Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: Proceedings of 11th Data Compression Conference (DCC 2001), pp. 163–172 (2001)

    Google Scholar 

  9. Clearly, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)

    Article  Google Scholar 

  10. Dvorský, J., Pokorný, J., Snášel, V.: Word-based compression methods and indexing for text retrieval systems. In: Eder, J., Rozman, I., Welzer, T. (eds.) ADBIS 1999. LNCS, vol. 1691, pp. 75–84. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  11. Harman, D.: Overview of the Third Text REtrieval Conference (NIST Special Publication 500-207). In: Proc. Third Text REtrieval Conference (TREC-3), pp. 1–19. NIST Special Publication 207-500 (1995)

    Google Scholar 

  12. Heaps, H.S.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)

    MATH  Google Scholar 

  13. Horspool, R.N., Cormack, G.V.: Constructing word-based text compression algorithms. In: Proceedings of 2nd Data Compression Conference (DCC 1992), pp. 62–71 (1992)

    Google Scholar 

  14. Liefke, H., Suciu, D.: XMill: an efficient compressor for XML data. In: Proc. ACM SIGMOD 2000, pp. 153–164 (2000)

    Google Scholar 

  15. Moffat, A.: Word-based text compression. Software - Practice and Experience 19(2), 185–198 (1989)

    Article  Google Scholar 

  16. Moffat, A., Isal, R.Y.K.: Word-based text compression using the Burrows–Wheeler transform. Information Processing & Management 41(5), 1175–1192 (2005)

    Article  MATH  Google Scholar 

  17. Moura, E., Navarro, G., Ziviani, N.: Indexing compressed text. In: Proceedings of the Fourth South American Workshop on String Processing, pp. 95–111 (1997)

    Google Scholar 

  18. Shkarin, D.: PPM: One step to practicality. In: Proceedings of 12th Data Compression Conference (DCC 2002), pp. 202–211 (2002)

    Google Scholar 

  19. Skibinski, P., Grabowski, S., Deorowicz, S.: Revisiting dictionary-based compression. Software–Practice and Experience 35(15), 1455–1476 (2005)

    Article  Google Scholar 

  20. Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison–Wesley (1949)

    Google Scholar 

  21. Ziviani, N., Moura, E., Navarro, G., Baeza-Yates, R.: Compression: A key for next-generation text retrieval systems. IEEE Computer 33(11), 37–44 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Adiego, J., de la Fuente, P. (2006). Mapping Words into Codewords on PPM. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_15

Download citation

  • DOI: https://doi.org/10.1007/11880561_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45774-9

  • Online ISBN: 978-3-540-45775-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics