Abstract
There exist practical bit-parallel algorithms for several types of pair-wise string processing, such as longest common subsequence computation or approximate string matching. The bit-parallel algorithms typically use a size-σ table of match bit-vectors, where the bits in the vector for a character λ identify the positions where the character λ occurs in one of the processed strings, and σ is the alphabet size. The time or space cost of computing the match table is not prohibitive with reasonably small alphabets such as ASCII text. However, for example in the case of general Unicode text the possible numerical code range of the characters is roughly one million. This makes using a simple table impractical. In this paper we evaluate three different schemes for overcoming this problem. First we propose to replace the character code table by a character code automaton. Then we compare this method with two other schemes: using a hash table, and the binary-search based solution proposed by Wu, Manber and Myers [25]. We find that the best choice is to use either the automaton-based method or a hash table.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aho, A., Corasick, M.: Efficient string matching: an aid to bibliographic search. Communications of the ACM 18(6), 333–340 (1975)
Allison, A., Dix, T.L.: A bit-string longest common subsequence algorithm. Information Processing Letters 23, 305–310 (1986)
Baeza-Yates, R., Gonnet, G.: A new approach to text searching. Communications of the ACM 35(10), 74–82 (1992)
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Communications of the ACM 20(10), 762–772 (1977)
Crochemore, M., Iliopoulos, C.S., Pinzon, Y.J., Reid, J.F.: A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters 80, 279–285 (2001)
Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, Oxford (1994)
Czumaj, A., Crochemore, M., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W.: Speeding up two string-matching algorithms. Algorithmica 12, 247–267 (1994)
Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)
Hyyrö, H.: Explaining and extending the bit-parallel approximate string matching algorithm of Myers. Technical Report A-2001-10, Dept. of Computer and Information Sciences, University of Tampere, Tampere, Finland (2001)
Hyyrö, H.: Bit-parallel approximate string matching with transposition. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 95–107. Springer, Heidelberg (2003)
Hyyrö, H.: Bit-parallel LCS-length computation revisited. In: Proc. 15th Australasian Workshop on Combinatorial Algorithms, AWOCA 2004 (2004)
Hyyrö, H., Navarro, G.: Faster bit-parallel approximate string matching. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, p. 203. Springer, Heidelberg (2002)
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(1), 323–350 (1977)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM 46(3), 395–415 (1999)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Navarro, G.: NR-grep: a fast and flexible pattern matching tool. Software Practice and Experience 31, 1265–1312 (2001)
Navarro, G., Raffinot, M.: Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithms 5(4) (2000)
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge (2002)
Robertson, A.M., Willett, P.: A comparison of spelling-correction methods for the identification of word forms in historical text databases. Literary and Linguistic Computing 8(3), 143–152 (1993)
Sankoff, D., Kruskal, J. (eds.): Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading (1983)
Takeda, M., Miyamoto, S., Kida, T., Shinohara, A., Fukumachi, S., Shinohara, T., Arikawa, S.: Processing text files as is: Pattern matching over compressed tests, multi-byte character texts, and semi-structured tests. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, p. 170. Springer, Heidelberg (2002)
Unicode Consortium.: Unicode Home Page, http://www.unicode.org/
Unicode Consortium.: The Unicode Standard 4.0. Addison-Wesley (2003)
Wu, S., Manber, U., Myers, E.: A sub-quadratic algorithm for approximate limited expression matching. Algorithmica 15(1), 50–67 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hyyrö, H., Takaba, J., Shinohara, A., Takeda, M. (2005). On Bit-Parallel Processing of Multi-byte Text. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-31871-2_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)