Abstract
Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the orig- inal text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression.
In this paper, we bring out a potential advantage of BPE compression. We show that it is very suitable from a practical view point of com- pressed pattern matching, where the goal is to find a pattern directly in compressed text without decompressing it explicitly. We compare run- ning times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv- Welch compressed files, and (3) original text files, in various situations. Experimental results show that pattern matching in BPE compressed text is even faster than matching in the original text. Thus the BPE compression reduces not only the disk space but also the searching time.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
A. Apostolico and Z. Galil. Pattern Matching Algorithm. Oxford University Press, New York, 1997.
S. Arikawa and S. Shiraishi. Pattern matching machines for replacing several character strings. Bulletin of Informatics and Cybernetics, 21(1-2):101–111, 1984.
R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Comm. ACM, 35(10):74–82, 1992.
R. S. Boyer and J. S. Moore. A fast string searching algorithm. Comm. ACM, 20(10):62–72, 1977.
D. Breslauer. Saving comparisons in the Crochemore-Perrin string matching algorithm. In Proc. of 1st European Symp. on Algorithms, pages 61–72, 1993.
M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string-matching algorithm. Algorithmica, 12(4/5):247–267, 1994.
M. Crochemore and D. Perrin. Two-way string-matching. J. ACM, 38(3):651–675, 1991.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York, 1994.
E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Direct pattern matching on compressed text. In Proc. 5th International Symp. on String Processing and Information Retrieval, pages 90–95. IEEE Computer Society, 1998.
P. Gage. A new algorithm for data compression. The C Users Journal, 12(2), 1994.
Z. Galil and J. Seiferas. Time-space-optimal string matching. J. Comput. System Sci., 26(3):280–294, 1983.
L. Gasieniec, W. Plandowski, and W. Rytter. Constant-space string matching with smaller number of comparisons: Sequential sampling. In Proc. 6th Ann. Symp. on Combinatorial Pattern Matching, pages 78–89. Springer-Verlag, 1995.
L. Gasieniec, W. Plandowski, and W. Rytter. The zooming method: a recursive approach to time-space efficient string-matching. Theoret. Comput. Sci, 147(1/2):19–30, 1995.
R. N. Horspool. Practical fast searching in strings. Software-Practice and Experience, 10:501–506, 1980.
G. C. Jewell. Text compaction for information retrieval. IEEE SMC Newsletter, 5, 1976.
T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A unifying framework for compressed pattern matching. In Proc. 6th International Symp. on String Processing and Information Retrieval, pages 89–96. IEEE Computer Society, 1999.
T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-And approach to pattern matching in LZW compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 1–13. Springer-Verlag, 1999.
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In J. A. Atorer and M. Cohn, editors, Proc. Data Compression Conference '98, pages 103–112. IEEE Computer Society, 1998.
D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput, 6(2):323–350, 1977.
U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In Proc. Combinatorial Pattern Matching, volume 807 of Lecture Notes in Computer Science, pages 113–124. Springer-Verlag, 1994.
G. Navarro and M. Raffiot. A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. 10th Ann. Symp. on Combinatorial Pattern Matching, pages 14–36. Springer-Verlag, 1999.
D. M. Sunday. A very fast substring search algorithm. Comm. ACM, 33(8):132–142, 1990.
M. Takeda. An efficient multiple string replacing algorithm using patterns with pictures. Advances in Software Science and Technology, 2:131–151, 1990.
B. W. Watson and G. Zwaan. A taxonomy of sublinear multiple keyword pattern matching algorithms. Sci. of Comput. Programing., 27(2):85–118, 1996.
S. Wu and U. Manber. Agrep-a fast approximate pattern-matching tool. In Usenix Winter 1992 Technical Conference, pages 153–162, 1992.
S. Wu and U. Manber. Fast text searching allowing errors. Comm. ACM, 35(10):83–91, October 1992.
A. C.-C. Yao. The complexity of pattern matching for a random string. SIAM J. Comput., 8(3):368–387, 1979.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shibata, Y. et al. (2000). Speeding Up Pattern Matching by Text Compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds) Algorithms and Complexity. CIAC 2000. Lecture Notes in Computer Science, vol 1767. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46521-9_25
Download citation
DOI: https://doi.org/10.1007/3-540-46521-9_25
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67159-6
Online ISBN: 978-3-540-46521-8
eBook Packages: Springer Book Archive