Skip to main content

Speeding Up Pattern Matching by Text Compression

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1767))

Abstract

Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the orig- inal text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression.

In this paper, we bring out a potential advantage of BPE compression. We show that it is very suitable from a practical view point of com- pressed pattern matching, where the goal is to find a pattern directly in compressed text without decompressing it explicitly. We compare run- ning times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv- Welch compressed files, and (3) original text files, in various situations. Experimental results show that pattern matching in BPE compressed text is even faster than matching in the original text. Thus the BPE compression reduces not only the disk space but also the searching time.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Apostolico and Z. Galil. Pattern Matching Algorithm. Oxford University Press, New York, 1997.

    Google Scholar 

  2. S. Arikawa and S. Shiraishi. Pattern matching machines for replacing several character strings. Bulletin of Informatics and Cybernetics, 21(1-2):101–111, 1984.

    MATH  MathSciNet  Google Scholar 

  3. R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Comm. ACM, 35(10):74–82, 1992.

    Article  Google Scholar 

  4. R. S. Boyer and J. S. Moore. A fast string searching algorithm. Comm. ACM, 20(10):62–72, 1977.

    Article  Google Scholar 

  5. D. Breslauer. Saving comparisons in the Crochemore-Perrin string matching algorithm. In Proc. of 1st European Symp. on Algorithms, pages 61–72, 1993.

    Google Scholar 

  6. M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string-matching algorithm. Algorithmica, 12(4/5):247–267, 1994.

    Article  MATH  MathSciNet  Google Scholar 

  7. M. Crochemore and D. Perrin. Two-way string-matching. J. ACM, 38(3):651–675, 1991.

    Article  MATH  MathSciNet  Google Scholar 

  8. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994.

    MATH  Google Scholar 

  9. E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Direct pattern matching on compressed text. In Proc. 5th International Symp. on String Processing and Information Retrieval, pages 90–95. IEEE Computer Society, 1998.

    Google Scholar 

  10. P. Gage. A new algorithm for data compression. The C Users Journal, 12(2), 1994.

    Google Scholar 

  11. Z. Galil and J. Seiferas. Time-space-optimal string matching. J. Comput. System Sci., 26(3):280–294, 1983.

    Article  MathSciNet  Google Scholar 

  12. L. Gasieniec, W. Plandowski, and W. Rytter. Constant-space string matching with smaller number of comparisons: Sequential sampling. In Proc. 6th Ann. Symp. on Combinatorial Pattern Matching, pages 78–89. Springer-Verlag, 1995.

    Google Scholar 

  13. L. Gasieniec, W. Plandowski, and W. Rytter. The zooming method: a recursive approach to time-space efficient string-matching. Theoret. Comput. Sci, 147(1/2):19–30, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  14. R. N. Horspool. Practical fast searching in strings. Software-Practice and Experience, 10:501–506, 1980.

    Article  Google Scholar 

  15. G. C. Jewell. Text compaction for information retrieval. IEEE SMC Newsletter, 5, 1976.

    Google Scholar 

  16. T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th International Symp. on String Processing and Information Retrieval, pages 89–96. IEEE Computer Society, 1999.

    Google Scholar 

  17. T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 1–13. Springer-Verlag, 1999.

    Google Scholar 

  18. T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In J. A. Atorer and M. Cohn, editors, Proc. Data Compression Conference '98, pages 103–112. IEEE Computer Society, 1998.

    Google Scholar 

  19. D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput, 6(2):323–350, 1977.

    Article  MATH  MathSciNet  Google Scholar 

  20. U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In Proc. Combinatorial Pattern Matching, volume 807 of Lecture Notes in Computer Science, pages 113–124. Springer-Verlag, 1994.

    Google Scholar 

  21. G. Navarro and M. Raffiot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 14–36. Springer-Verlag, 1999.

    Google Scholar 

  22. D. M. Sunday. A very fast substring search algorithm. Comm. ACM, 33(8):132–142, 1990.

    Article  Google Scholar 

  23. M. Takeda. An efficient multiple string replacing algorithm using patterns with pictures. Advances in Software Science and Technology, 2:131–151, 1990.

    MathSciNet  Google Scholar 

  24. B. W. Watson and G. Zwaan. A taxonomy of sublinear multiple keyword pattern matching algorithms. Sci. of Comput. Programing., 27(2):85–118, 1996.

    Article  MATH  MathSciNet  Google Scholar 

  25. S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Usenix Winter 1992 Technical Conference, pages 153–162, 1992.

    Google Scholar 

  26. S. Wu and U. Manber. Fast text searching allowing errors. Comm. ACM, 35(10):83–91, October 1992.

    Article  Google Scholar 

  27. A. C.-C. Yao. The complexity of pattern matching for a random string. SIAM J. Comput., 8(3):368–387, 1979.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shibata, Y. et al. (2000). Speeding Up Pattern Matching by Text Compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds) Algorithms and Complexity. CIAC 2000. Lecture Notes in Computer Science, vol 1767. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46521-9_25

Download citation

  • DOI: https://doi.org/10.1007/3-540-46521-9_25

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67159-6

  • Online ISBN: 978-3-540-46521-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics