Abstract
This paper proposes new algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. Fixed-length approximate string matching and approximate circular string matching are special cases of approximate string matching and have numerous direct applications in bioinformatics and text searching. Firstly, a counter-vector-mismatches (CVM) algorithm is proposed to solve fixed-length approximate string matching with k-mismatches. The development of CVM algorithm is based on the parallel summation of counters located in the same machine word. Secondly, a parallel counter-vector-mismatches (PCVM) algorithm is proposed to accelerate CVM algorithm in parallel. The PCVM algorithm is integrated into two-level parallelisms that exploit not only word-level parallelism but also data parallelism via parallel environments such as multi-core processors and graphics processing units (GPUs). In the particular case of adopting GPUs, a shared-mem parallel counter-vector-mismatches (PCVMsmem) scheme can be implemented from PCVM algorithm. The PCVMsmem scheme can exploit the memory model of GPUs to optimize performance of PCVM algorithm. Finally, this paper shows several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure. In the experiments with real DNA packages, our proposed algorithms and scheme work greatly faster than previous bit-vector-mismatches and parallel bit-vector-mismatches algorithms.
Similar content being viewed by others
Change history
20 March 2018
The funding information is missing in the Acknowledgements section of the original article. The correct wording is given below.
References
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv (CSUR) 33(1):31–88
Kefu X, Cui W, Yue H, Guo L (2013) Bit-parallel multiple approximate string matching based on GPU. Proc Comput Sci 17:523–529
Man D, Nakano K, Ito Y (2013) The approximate string matching on the hierarchical memory machine, with performance evaluation. In: Proceedings of the 7th IEEE international symposium embedded multicore socs (MCSoC). IEEE, pp 79–84
Michailidis PD, Margaritis KG (2005) A programmable array processor architecture for flexible approximate string matching algorithms. In: 2005 International Conference on Parallel Processing Workshops (ICPPW’05). IEEE, pp 201–209
Guo Longjiang, Du Shufang, Ren Meirui, Liu Yu, Li Jinbao, He Jing, Tian Ning, Li Keqin (2013) Parallel algorithm for approximate string matching with k-differences. In: Proceedings of the 8th IEEE International Conference Networking, Architecture and Storage (NAS). IEEE, pp 257–261
Hyyrö H (2003) A bit-vector algorithm for computing Levenshtein and Damerau edit distances. Nord. J. Comput. 10(1):29–39
Ho TL, Seung-Rohk O, Kim HJ (2017) A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PLoS ONE 12(10):e0186251
Amir A, Lewenstein M, Porat E (2004) Faster algorithms for string matching with \(k\)-mismatches. Journal of Algorithms 50(2):257–275
Barton C, Iliopoulos CS, Pissis SP (2014) Fast algorithms for approximate circular string matching. Algorithms Mol Biol 9(1):9
Liu Y, Guo L, Li J, Ren M, Li K (2012) Parallel algorithms for approximate string matching with \(k\)-mismatches on CUDA. In: Proceedings of the 26th IEEE International Conference on Parallel and Distributed Processing Symposium Workshops & Ph.D. forum (IPDPSW). IEEE, pp 2414–2422
Ho TL, Seung-Rohk O, Kim HJ (2016) Circular bit-vector-mismatches: a new approximate circular string matching with \(k\)-mismatches. IEICE Trans Fundam Electron Commun Comput Sci 99:1726–1729
Iliopoulos CS, Mouchard L, Pinzon YJ (2001) The Max-Shift algorithm for approximate string matching. In: Brodal GS, Frigioni D, Marchetti-Spaccamela A (eds) Algorithm engineering. Springer, Berlin, Heidelberg, pp 13–25
Landau GM, Myers EW, Schmidt JP (1998) Incremental string comparison. SIAM J Comput 27(2):557–582
Chapman B et al (2010) A parallel algorithm for the fixed-length approximate string matching problem for high throughput sequencing technologies. Parallel Comput From Multicores GPU’s Petascale 19:150
Crochemore M, Iliopoulos CS, Pissis SP (2010) A parallel algorithm for fixed-length approximate string-matching with \(k\)-mismatches. In: Elomaa T, Mannila H, Orponen P (eds) Algorithms and applications. Springer, Berlin, Heidelberg, pp 92–101
Pissis S, Retha A (2015) Generalised implementation for fixed-length approximate string matching under Hamming distance and applications. In: Proceedings of IEEE international workshop parallel distributed processing symposium (IPDPSW). IEEE, pp 367–374
Barton C, Iliopoulos CS, Kundu R, Pissis SP, Retha A, Vayani F (2015) Accurate and efficient methods to improve multiple circular sequence alignment. In: Bampis E (ed) Experimental algorithms. Springer, Cham, Switzerland, pp 247–258
Pissis SP, Stamatakis A, Pavlidis P(2013) MoTeX: a word-based HPC tool for MoTif eXtraction. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, Computational Biology and Biomedical Informatics. ACM, pp 13
Pissis SP (2014) MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform 15(1):235
NVIDIA (2017) GeForce GTX 1080. https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080. Accessed 27 Oct 2017
Intel (2017) Xeon CPU E5-2630 V3. https://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz. Accessed 27 Oct 2017
Stothard P (2017) Ramdom DNA pattern, bioinformatics. http://www.bioinformatics.org/sms2/dna_pattern.html. Accessed 4 Mar 2017
Saccharomyces Genome Database (2017) DNA sequences. http://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna. Accessed 4 Mar 2017
Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82
Grabowski S, Fredriksson K (2008) Bit-parallel string matching under Hamming distance in O(n[m/w]) worst case time. Inf Process Lett 105(5):182–187
Lin CH, Wang GH, Huang CC (2014) Hierarchical parallelism of bit-parallel algorithm for approximate string matching on GPUs. In: Proceedings of IEEE symposium on computer applications and communications (SCAC). IEEE, pp 76–81
Ho TL, Seung-Rohk O, Kim HJ (2016) PAC-k: a parallel Aho–Corasick string matching approach on graphic processing units using non-overlapped threads. IEICE Trans Commun 99(7):1523–1531
NVIDIA (2017). http://www.nvidia.com/page/home.html. Accessed 4 Mar 2017
Fang J, Varbanescu AL, Sips H (2011) A comprehensive performance comparison of CUDA and OpenCL. In: 2011 International Conference on Parallel Processing (ICPP). IEEE, pp 216–225
NVIDIA (2017) GeForce GTX 780. https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-780/specifications. Accessed 27 Oct 2017
NVIDIA (2017) GeForce GTX 660. http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660. Accessed 27 Oct 2017
Acknowledgements
We would like to thank Mr. Ji-Won Song, MS. candidate at School of EEE in Dankook University, who have helped us set up the Linux-based experimental environment using GPU.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ho, T., Oh, SR. & Kim, H. New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. J Supercomput 74, 1815–1834 (2018). https://doi.org/10.1007/s11227-017-2192-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2192-6