Skip to main content

Sublinear Algorithms for Approximating String Compressibility

  • Conference paper
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX 2007, RANDOM 2007)

Abstract

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.

Our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. In addition, we show that approximating the compressibility with respect to LZ is related to approximating the support size of a distribution.

A full version of this paper is available [17]. These results appeared previously as part of a technical report [16].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 266–275. ACM Press, New York (2001)

    Google Scholar 

  3. Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. SIAM Journal on Computing 35(1), 132–150 (2005)

    Article  MathSciNet  Google Scholar 

  4. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4) (2002). (See comment by Khmelev DV, Teahan WJ, Phys Rev Lett. 90(8):089803, (2003) and the reply Phys Rev Lett. 90(8):089804, 2003)

    Google Scholar 

  5. Brautbar, M., Samorodnitsky, A.: Approximating the entropy of large alphabets. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2007)

    Google Scholar 

  6. Bunge, J.: Bibligraphy on estimating the number of classes in a population, http://www.stat.cornell.edu/~bunge/bibliography.htm

  7. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279. ACM, New York (2000)

    Chapter  Google Scholar 

  8. Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  9. Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J.-Y., Cooper, S.B., Li, A. (eds.) TAMC 2006. LNCS, vol. 3959, pp. 21–45. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  10. Cover, T., Thomas, J.: Elements of Information Theory. Wiley & Sons, Chichester (1991)

    Book  MATH  Google Scholar 

  11. Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Prob. Peredachi Inf. 37(2), 96–98 (2000) [Probl. Inf. Transm. ( Engl. Transl.) 37, 172–184 (2001)]

    Google Scholar 

  12. Lehman, E., Shelat, A.: Approximation algorithms for grammer-based compression. In: Proc. 18th Annual Symp. on Discrete Algorithms, pp. 205–212 (2002)

    Google Scholar 

  13. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004) (Prelim. version in SODA 2003)

    Google Scholar 

  14. Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Heidelberg (1997)

    Book  MATH  Google Scholar 

  15. Loewenstern, D., Hirsh, H., Noordewier, M., Yianilos, P.: DNA sequence classification using compression-based induction. Technical Report 95-04, Rutgers University, DIMACS (1995)

    Google Scholar 

  16. Raskhodnikova, S., Ron, D., Rubinfeld, R., Shpilka, A., Smith, A.: Sublinear algorithms for approximating string compressibility and the distribution support size. Electronic Colloquium on Computational Complexity, TR05-125 (2005)

    Google Scholar 

  17. Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. Full version of this paper, in preparation, Arxiv Report 0706.1084 [cs.DS] (June 2007)

    Google Scholar 

  18. Raskhodnikova, S., Ron, D., Shpilka, A., Smith, A.: On the difficulty of approximating the support size of a distribution (Manuscript 2007)

    Google Scholar 

  19. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  20. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 530–536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A. (2007). Sublinear Algorithms for Approximating String Compressibility. In: Charikar, M., Jansen, K., Reingold, O., Rolim, J.D.P. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2007 2007. Lecture Notes in Computer Science, vol 4627. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74208-1_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74208-1_44

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74207-4

  • Online ISBN: 978-3-540-74208-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics