Abstract
We explore the benefits of parallelizing 7 state-of-the-art string matching algorithms. Using SIMD and multi-threading techniques we achieve a significant performance improvement of up to 43.3\(\times \) over reference implementations and a speedup of up to 16.7\(\times \) over the string matching program grep.
We evaluate our implementations on the smart-corpora and the full human genome data set. We show scalability over number of threads and impact of pattern length.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
String matching is a fundamental tool in a wide range of practical software. Molecular biology, data compression and information retrieval all rely on efficient string matching algorithms on challenging amounts of input data. For over 35 years string matching algorithms have been studied extensively. Speed and memory constraints are the crucial attributes of state-of-the-art matching algorithms.
Parallelization has become an essential part of algorithm design. Multi-threading, heterogeneous computing and SIMD (single instruction stream, multiple data stream) instructions are the current tools of the trade. Due to the data-parallel nature of most string matching algorithms, these techniques can be used to achieve significant performance gains.
In this paper, we propose parallelization improvements to existing state-of-the-art string matching algorithms. We explore a chunking approach, partitioning the input data and distributing the workload with a thread pool. We utilize modern SIMD-instructions to improve throughput in computational intensive situations and optimize data structures for parallel access. Our implementations are evaluated on the smart-corpora [11] and the human genomeFootnote 1 [20]. To demonstrate the effectiveness of our approach, we compare the runtime of our implementations with sequential reference implementations provided by the smart-corpora as well as the string matching program grep Footnote 2 of the GNU/Linux operating system.
Pattern length and alphabet size influence the effectiveness of different algorithms, choosing the optimal implementation therefore depends on those two parameters. Our evaluation considers different combinations of pattern length and alphabet size. When both parameters are known at runtime this information can be used to choose the optimal algorithm.
2 Problem Definition
We define the problem of string matching as the task of finding a pattern P of length \(m = |P|\) in a text T of length \(n = |T|\). Pattern and text are based on an alphabet \(\varSigma \). The results are the absolute positions of every occurrence of P in T. The input is dynamic, preprocessing of pattern or text have to take place at runtime. Only exact matches are returned, approximate matches or regular expression patterns are not considered.
3 Related Work
The introduction of the Knuth et al. [17] and Boyer and Moore [4] algorithms which are, respectively, the first linear and the first sublinear string matching algorithms, initiated the ongoing search for ever faster matching approaches. Both of these inspired many variations. Prominent examples are Horspool [15] and QuickSearch [25], simplifying variations of Boyer-Moore, which have proven to be efficient in practice. The Rabin and Karp [16] algorithm is an alternative solution to the string matching problem, testing for matches based on hashes computed from the input text and pattern.
In more recent years, many more variations and combinations of the classical matching algorithms have been proposed. Faro and Lecroq [11] report on more than 50 new algorithms that have been published since 2000. One example is the Average Optimal Shift-Or algorithm by Fredriksson and Grabowski [12], an extension of the original Shift-Or [2], which leverages bit-parallelism within pattern and text comparison. The BNDM algorithm by Navarro and Raffinot [23] is based on the same principle, and combines it with suffix automata to find matches by efficiently identifying all subpatterns of a word. Another family of algorithms which relies on finding subpatterns is BOM [1] and its variations (cf. e.g. [10]).
For a more detailed and more complete overview of recent advances in string matching algorithms we direct the interested reader to Faro’s and Lecroq’s review article [11].
Despite the global trend in industry and research to increase performance by parallelizing algorithms, to the best of our knowledge, only few parallel approaches to string matching exist, even though efficient theoretical solutions have been proposed: The optimal parallel algorithm for a CREW-PRAM (concurrent-read, exclusive write parallel random access machine) runs in \(O(\log ^2 n)\) [13]. For a CRCW-PRAM, even a constant time solution has been proposed [14]. There are, however, no practical implementations available for these theoretical algorithms. Nevertheless, there are several published approaches that in some sense rely on inherently parallel properties of string comparisons, such as by exploiting bit-parallelism [5] in comparing strings (cf. e.g. the Shift-Or algorithm [2] and its derivatives, or the works of Cantone et al. [6] or Peltola and Tarhio [24], among many others). Faro and Kúlekci, on the other hand, further increase the benefits of these approaches by using modern processor’s SIMD extensions [9, 19].
Although there is a surprising lack of approaches leveraging classical threading parallelism, there are some works which explore the benefits provided by the massive parallel computing power within modern GPUs. Kouzinopoulos and Margaritis evaluate the performance of GPU implementations of the classical matching algorithms [18] and report on a possible speedup of more than 10\(\times \). Vasiliadis et al. [26] and Cascarano et al. [7] present solutions for regular expression matching in GPUs, which is a superset of the string matching problem. These approaches create finite state machines from the input patterns and execute them in parallel on partitioned input data. Another problem related to string matching is the approximate string matching problem, which allows for missing some possible matches in exchange for speed. Liu et al. [22] present GPU-based solutions and report on up to 80\(\times \) speedups.
4 Implementation
We implement a general chunking approach for all of our string matching implementations. The initial text T is split into chunks of size \(s = \max (2 * m, s_a)\) where \(s_a\) is 4MiB for the SSEF algorithm and 1MiB for all other algorithms. A thread pool runs string matching tasks on these chunks in parallel. The string matching tasks examine an additional overlap of \(m-1\) characters after each chunk to ensure matches that cross chunk boundaries are found. This also avoids inter-chunk synchronization in the matching algorithm. If the text size is not large enough to create at least one chunk per thread, we reduce the chunk size to \(s = n / thread\_count \). To preserve global ordering the matching results are written to a synchronized set.
We employ SSE (streaming SIMD extensions) in the appropriate implementations. We use the SSE instruction set (up to version 4.1), as it is supported by Intel and AMD CPUs. The resulting bit-parallelism is essential for high throughput on modern CPU cores.
Our implementations can be found on our project pageFootnote 3. We provide a unified C++ interface for all discussed algorithms.
The following subsections give a brief overview of the implemented algorithms. Of particular interest are our modifications to the SSEF algorithm. For a more detailed discussion we refer to the referenced articles.
4.1 Knuth-Morris-Pratt
The well-known Knuth-Morris-Pratt (KMP) algorithm was first published in 1977 [17]. It uses a preprocessing phase on the pattern to build a partial match table. This table can be used to skip known matching prefixes after a partial match was found. Once matched characters are therefore never visited again. The preprocessing phase runs in O(m) and the actual matching in O(n), resulting in an asymptotic runtime of \(O(n+m)\).
4.2 Shift-Or
The Shift-Or algorithm proposed by Baeza-Yates and Gonnet in 1992 uses efficient bitwise operations [2]. For each character c in the alphabet \(\varSigma \) an occurrence bit-vector \(o_c\) is calculated in a preprocessing phase.
In the matching phase a result bit-vector r is iteratively and-combined with the occurrence vector of the current character. Vector r is then bit-shifted by one position and incremented by one. A match is found when \(r[m] = 1\). We use a word size of 64bit for the bit-vectors. The runtime is deterministic and in \(O(n*m)\).
4.3 Hash3
Lecroq’s Hashq algorithm from 2007 [21] is based on hash values for q-grams. The preprocessing phase computes a shift table for each hashed q-gram in the pattern. The search algorithm then hashes sub-strings of length q and skips characters according to the precomputed shift table. Potential matches are checked naively. Choosing \(q = 3\) promises the best results for medium length patterns. Hash3 requires a minimum pattern length of \(m = 3\).
4.4 SSEF
The SSEF algorithm [19] precomputes 65536 filter lists based on the kth bit of each character on the pattern. These filters are then applied efficiently, utilizing SSE instructions, on shifting alignments of pattern and text. SSEF is restricted to patterns with a minimum length of \(m \ge 32\). The worst case runtime is in \(O(n*m)\). If we consider the probability to filter possible matches, SSEF achieves an average runtime in \(O(n * m/65536)\).
In the original SSEF algorithm parameter k has to be specified by the user. The smart-corpora implementation chooses a fixed value of \(k=7\). We improved on this by finding the bit that carries the most information in the pattern. We count the set bit positions in each character of the pattern and choose the bit that carries the most information, see Table 1 for an example. Optimally the kth bit is set 50% of the time.
A second optimization is the filter list itself. The original algorithm and the smart-corpora implementation use a linked list and allocate each entry dynamically. The reference performs separate heap allocations for each individual entry. As the number of entries in this linked list is fixed for a given pattern size, we only allocate a single chunk of memory. This allows us to use simple offsets (instead of pointers) to address the list entries. Also we minimize the total memory footprint of the filter list by automatically using the smallest data type possible to store the offsets inside the list. This has the fortunate side effect of improved cache locality.
4.5 Variants of the Backward-Oracle-Matching
Faro and Lecroq presented Extended-Backward-Oracle-Matching (EBOM) and Forward-Simplified-Backward-Nondeterministic-DAWG-Matching (FSBNDM) in 2009 [10]. Both are variants of the Backward-Oracle-Matching algorithm and based on finite automata.
Extended-Backward-Oracle-Matching. The EBOM algorithm extends Backward-Oracle-Matching with a fast-loop. The fast-loop technique iterates a matching heuristic in a non-branching cycle. This is used to quickly locate the last character of the pattern in the currently observed text window. In each iteration two consecutive characters are handled. EBOM requires a preprocessing phase in \(O(|\varSigma |^2)\).
Forward-Simplified-Backward - Nondeterministic - DAWG - Matching. The FSBNDM algorithm uses bit-parallelism to implement a non-deterministic forward automaton on the reversed pattern. The preprocessing phase can be performed in \(O(|\varSigma | + m)\).
4.6 Exact-Packed-String-Matching
Exact-Packed-String-Matching (EPSM) was presented in 2013 by Faro and Külekci [8]. EPSM makes use of bit-parallelism by packing several characters into a bit-word and partitioning text T into chunks \(T_i\). These bit-word sized chunks are compared with a packed pattern bit-word. Shift and bitwise-and operations are used to efficiently compare text chunks with the pattern. Our implementation uses SSE registers as 128 bit words. We limit the usage of EPSM to cases with short patterns (\(m \le 8\)). Under these restrictions EPSM is very fast and runs in O(n). The asymptotic runtime for the general case remains \(O(n*m)\).
5 Evaluation
In the following section we present the evaluation of the performance of our parallelized string matching algorithms. We show experimental results obtained from two benchmarks using the smart-corpora [11] and the human genome [20]. The human genome benchmark input text is the assembly of the human genome, which is 3.1 GB in size and uses an alphabet of four characters. The smart-corpora benchmark is comprised of seven input texts from the smart-coprora archive:
-
The text of the English King James Bible, containing natural English language with a complete alphabet of 63 characters.
-
A set of genome sequences for the E. Coli bacterium. The DNA is encoded over an alphabet of size 4.
-
Four protein sequences hi, hs, mj, sc, with an alphabet of 20 characters (19 characters for the hs protein).
-
The CIA world fact book. Natural English language with a few special characters. Alphabet size of 94.
For both benchmarks, we generate 10 patterns for every input file of the lengths 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024. The patterns for an input file are generated by randomly picking sequences of the respective length from the file, thus ensuring that there actually are matches for every file and pattern. The benchmark results shown in the remainder of this chapter are averaged over all 10 patterns for every file and configuration. To assess the benefits of parallelization, all experiments are conducted using 1,2,4, and 8 threads.
Additionally, we compare the performance results of our implementations with sequential reference implementations of the respective algorithm provided by the smart-corpora as well as the string matching program grep of the GNU/Linux operating system.
Input files are directly mapped into the application’s memory to reduce I/O latencies. We ensured that input files are completely cached by the operating system. To get comparable results we used an equivalent memory-mapping interface for the smart-corpora algorithms. Memory-mapping is used in grep as well. We invoke grep with the parameters \(\texttt {grep<pattern> <file> -c}\). To benchmark the actual string matching we use the additional switch -c to suppresses output of the individual matches and instead print the count of matching lines. The runtimes of grep and our implementations thus encompass the matching algorithm including all synchronization but minimize file and screen I/O. To run the benchmarks we used temci [3], a benchmarking helper tool, in combination with perf, a tool for profiling with performance counters. All experiments were performed on an Intel Xeon E5 system, with 4 CPU cores (8 hardware threads) at 3.7 GHz.
In the following subsection we discuss an excerpt of our result data.
5.1 Results
Figure 1 shows the average time to match a pattern of length 32 on the genome data set for six algorithms on a logarithmic scale. We can observe linear scalability with increased thread count. Our FSBNDM implementation requires a minimum of two threads due to space limitations exceeded by the genome data set and the EPSM algorithm is not applicable due to \(m > 8\). The content of the patterns has an insignificant impact on performance. The maximum relative standard deviation over the patterns is 3% with a relative range of 9%.
Figure 2 shows the average absolute runtimes of six algorithms on the smart-corpora. The algorithms use up to 8 threads. The pattern length is 32. Both SSEF and Hash3 are consistently fast on all seven texts. The relative performance between the algorithms is surprisingly stable.
In Fig. 3 we see the average performance of seven algorithms over different pattern lengths. We use the bible text and our implementations use up to 8 threads. Several algorithms are restricted to specific pattern sizes. With increasing pattern length algorithm performance increases as well with the exception of EPSM which is optimal for \(m = 2\). The maximum relative standard deviation over the pattern contents is 26%. This increase compared to the genome data set is explained by the relative small runtime influenced by measuring fluctuations.
To assess the practicality of our implementations we compare our runtimes against the performance of grep. In Fig. 4 we show the relative speedups over different pattern lengths on the human genome data set on a logarithmic scale. In the case where we are limited to one thread, we can achieve a performance increase for pattern lengths between 4 and 128. However grep outperforms our implementations for patterns with \(m \le 2\) or \(m \ge 256\). If we utilize eight threads we can achieve significant speedups of up to 16.7\(\times \) for all patterns with \(m \ge 2\). SSEF, EBOM and Hash3 all perform consistently well on this data set.
Figure 5 shows the speedups of our implementations over the reference implementations found in the smart-corpora. The speedups are displayed on a logarithmic scale. The baseline for each algorithm is the corresponding reference implementation. In contrast to speedups on the human genome data set, only the EPSM, KMP and Shift-Or implementations benefit from an increased thread count on this smaller data set. However our modifications to the SSEF implementation result in a significant speedup even in the sequential case.
6 Conclusion
We used a chunking approach to parallelize seven state-of-the-art string matching algorithms. We have shown linear scalability on the number of threads for large input data. We observed the influence of pattern size on string matching algorithms. For short patterns EPSM and EBOM are the algorithms of choice, while bigger patterns favor Hash3, SSEF and FSBNDM.
SSEF is consistently fast over different alphabet sizes and the supported pattern lengths. With our modifications to SSEF we achieved a 43\(\times \) speedup over the reference implementation. Compared with grep we achieve significant speedups in all cases where the pattern has two or more characters. On the human genome data set the maximal speedup of SSEF compared to grep is 15\(\times \).
In the future we plan to explore a heterogeneous approach by distributing text chunks on CPUs, GPUs and Intel MICs.
Notes
- 1.
Dec. 2013 (GRCh38/hg38) assembly of the human genome (hg38, GRCh38 Genome Reference Consortium Human Reference 38 (GCA_000001405.2)). See http://genome.ucsc.edu/ for details on the data set.
- 2.
GNU grep 2.20, Copyright (C) 2014 Free Software Foundation, Inc. http://www.gnu.org/software/grep/.
- 3.
References
Allauzen, C., Crochemore, M., Raffinot, M.: Factor oracle: a new structure for pattern matching. In: Pavelka, J., Tel, G., Bartošek, M. (eds.) SOFSEM 1999. LNCS, vol. 1725, pp. 295–310. Springer, Heidelberg (1999). doi:10.1007/3-540-47849-3_18
Baeza-Yates, R., Gonnet, G.H.: A new approach to text searching. Commun. ACM 35(10), 74–82 (1992)
Bechberger, J.: temci (2016). http://temci.readthedocs.io
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)
Cantone, D., Faro, S., Giaquinta, E.: Bit-(parallelism)2: getting to the next level of parallelism. In: Boldi, P., Gargano, L. (eds.) FUN 2010. LNCS, vol. 6099, pp. 166–177. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13122-6_18
Cantone, D., Faro, S., Giaquinta, E.: A compact representation of nondeterministic (suffix) automata for the bit-parallel approach. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 288–298. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13509-5_26
Cascarano, N., Rolando, P., Risso, F., Sisto, R.: iNFAnt: NFA pattern matching on GPGPU devices. SIGCOMM Comput. Commun. Rev. 40(5), 20–26 (2010)
Faro, S., Külekci, M.O.: Fast packed string matching for short patterns. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments. Society for Industrial and Applied Mathematics (2013)
Faro, S., Külekci, M.O.: Fast and flexible packed string matching. J. Discret. Algorithms 28, 61–72 (2014)
Faro, S., Lecroq, T.: Efficient variants of the backward-oracle-matching algorithm. Int. J. Found. Comput. Sci. 20(6), 967–984 (2009)
Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), Article no. 13 (2013)
Fredriksson, K., Grabowski, S.: Practical and optimal string matching. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 376–387. Springer, Heidelberg (2005). doi:10.1007/11575832_42
Galil, Z.: Optimal parallel algorithms for string matching. Inf. Control 67(1), 144–157 (1985)
Galil, Z.: A constant-time optimal parallel string-matching algorithm. J. ACM 42(4), 908–918 (1995)
Horspool, R.N.: Practical fast searching in strings. Softw.: Pract. Exp. 10(6), 501–506 (1980)
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Kouzinopoulos, C.S., Margaritis, K.G.: String matching on a multicore GPU using CUDA. In: 13th Panhellenic Conference on Informatics, 2009, PCI 2009 (2009)
Külekci, M.O.: Filter based fast matching of long patterns by using SIMD instructions. In: Stringology (2009)
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al.: Initial sequencing and analysis of the human genome. Nature 409(6822), 860–921 (2001)
Lecroq, T.: Fast exact string matching algorithms. Inf. Process. Lett. 102(6), 229–235 (2007)
Liu, Y., Guo, L., Li, J., Ren, M., Li, K.: Parallel algorithms for approximate string matching with k mismatches on CUDA. In: Parallel and Distributed Processing Symposium Workshops PhD Forum (2012)
Navarro, G., Raffinot, M.: Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM J. Exp. Algorithmics 5, Article no. 4 (2000)
Peltola, H., Tarhio, J.: Alternative algorithms for bit-parallel string matching. In: Nascimento, M.A., Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 80–93. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39984-1_7
Sunday, D.M.: A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990)
Vasiliadis, G., Polychronakis, M., Ioannidis, S.: Parallelization and characterization of pattern matching using GPUs. In: IEEE International Symposium on Workload Characterization (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pfaffe, P., Tillmann, M., Lutteropp, S., Scheirle, B., Zerr, K. (2017). Parallel String Matching. In: Desprez, F., et al. Euro-Par 2016: Parallel Processing Workshops. Euro-Par 2016. Lecture Notes in Computer Science(), vol 10104. Springer, Cham. https://doi.org/10.1007/978-3-319-58943-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-58943-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58942-8
Online ISBN: 978-3-319-58943-5
eBook Packages: Computer ScienceComputer Science (R0)