Abstract
Approximate string matching (ASM) has a number of applications in many disciplines, ranging from information retrieval to gene matching. Conventional solution to this problem is based on the dynamic programming-based strategy having quadratic space and time complexity. The complexity of the conventional solution makes it impractical to search queries from the huge sequences having billions of characters. Therefore, many studies have been proposed that improves on the space and time requirement of the basic solution which includes heuristic, filtration, and index-based solutions. These existing solutions obtain the better performance by compromising on the completeness of the search. In this paper, we proposed the linear space algorithm for the approximate string matching problem while retaining the time complexity of conventional solution. The proposed method works in linear space without omitting any regions in the given text; hence, it finds all the possible matches. Conventional dynamic programming solution is modified in such a way that storage of complete trace back table is avoided by keeping only running count of each edit operation in the memory. A variety of laws and facts are discovered in classical dynamic programming table in that regard. We also presented the parallel approach to the proffered algorithm to improve the running time of the algorithm. The algorithm is evaluated on the CUDA-enabled GPUs. DNA sequences of sizes between 250 and 970 MBP are used for evaluation. Moreover, experiments are also performed by using natural language text to highlight the broader applicability of the proposed algorithm. Results show the substantial superiority of the algorithm in terms of performance and scalability compared to the state-of-the-art algorithms.








Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
All the source code can be found publicly at the following GitHub link: (https://github.com/sadiqumair/Space-efficient-Computation-of-Parallel-Approximate-String-Matching). The open source existing datasets used in this article can be found from the following links: (1) NCBI (ftp.ncbi.nlm.nih.gov/), (2) Smart (https://github.com/smart-tool/smart).
References
French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: CIKM, vol 97, Citeseer, pp 9–15
Jupin J, Shi JY (2014) Identity tracking in big data: preliminary research using in-memory data graph models for record linkage and probabilistic signature hashing for approximate string matching in big health and human services databases. In: Proceedings of the 2014 International Conference on Big Data Science and Computing, ACM, p 20
Sandes EFDO, Boukerche A, Melo ACMAD (2016) Parallel optimal pairwise biological sequence comparison: algorithms, platforms, and classification. ACM Comput Surv (CSUR) 48(4):63
Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Watcharapinchai N, Rujikietgumjorn S (2017) Approximate license plate string matching for vehicle re-identification. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, pp 1–6
Alonso DG, Teyseyre A, Soria A, Berdun L (2020) Hand gesture recognition in real world scenarios using approximate string matching. Multimed Tools Appl 79(29):20773–20794
Alba A, Mendez MO, Rubio-Rincon ME, Arce-Santana ER (2016) A consensus algorithm for approximate string matching and its application to QRS complex detection. Int J Mod Phys C 27(03):1650029
Hasan SS, Ahmed F, Khan RS (2015) Approximate string matching algorithms: a brief survey and comparison. Int J Comput Appl 120(8):1
Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1(4):359–373
Hyyrö H (2005) Bit-parallel approximate string matching algorithms with transposition. J Discrete Algorithms 3(2–4):215–229
Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM (JACM) 46(3):395–415
Weese D, Holtgrewe M, Reinert K (2012) Razers 3: faster, fully sensitive read mapping. Bioinformatics 28(20):2592–2599
Cheng H, Jiang H, Yang J, Xu Y, Shang Y (2015) Bitmapper: an efficient all-mapper based on bit-vector computing. BMC Bioinform 16(1):1–16
Fiori FJ, Pakalén W, Tarhio J (2022) Approximate string matching with SIMD. Comput J 65(6):1472–1488
Mitani Y, Ino F, Hagihara K (2016) Parallelizing exact and approximate string matching via inclusive scan on a GPU. IEEE Trans Parallel Distrib Syst 28(7):1989–2002
Pevzner PA, Waterman MS (1995) Multiple filtration and approximate pattern matching. Algorithmica 13(1):135–154
Kim J, Li C, Xie X (2016) Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), IEEE, pp 169–180
Marco-Sola S, Sammeth M, Guigó R, Ribeca P (2012) The gem mapper: fast, accurate and versatile alignment by filtration. Nat Methods 9(12):1185–1188
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):1–10
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760
Cheng H, Zhang Y, Xu Y (2018) Bitmapper2: a GPU-accelerated all-mapper based on the sparse q-gram index. IEEE/ACM Trans Comput Biol Bioinf 16(3):886–897
Tran NH, Chen X (2015) Amas: optimizing the partition and filtration of adaptive seeds to speed up read mapping. IEEE/ACM Trans Comput Biol Bioinf 13(4):623–633
Fredriksson K, Navarro G (2004) Average-optimal single and multiple approximate string matching. J Exp Algorithmics (JEA) 9:1–4
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv (CSUR) 33(1):31–88
Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137
Guo L, Du S, Ren M, Liu Y, Li J, He J, Tian N, Li K (2013) Parallel algorithm for approximate string matching with k differences. In: 2013 IEEE Eighth International Conference on Networking, Architecture and Storage, Washington, DC, USA, IEEE, pp 257–261
Ho T, Oh S-R, Kim H (2018) New algorithms for fixed-length approximate string matching and approximate circular string matching under the hamming distance. J Supercomput 74(5):1815–1834
Ibrahim OAS, Hamed BA, El-Hafeez TA (2022) A new fast technique for pattern matching in biological sequences. J Supercomput 2022:1–22
Landau GM, Vishkin U (1988) Fast string matching with k differences. J Comput Syst Sci 37(1):63–78
Galil Z, Park K (1990) An improved algorithm for approximate string matching. SIAM J Comput 19(6):989–999
Wu S, Manber U (1992) Fast text searching: allowing errors. Commun ACM 35(10):83–91
Šošić M, Šikić M (2017) Edlib: a c/c++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33(9):1394–1395
Porat B, Porat E (2009) Exact and approximate pattern matching in the streaming model. In: 2009 50th Annual IEEE Symposium on Foundations of Computer Science, IEEE, pp 315–323
Liu Y, Guo L, Li J, Ren M, Li K (2012) Parallel algorithms for approximate string matching with k mismatches on CUDA. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IEEE, pp 2414–2422
Ahmed P, Islam AS, Rahman MS (2013) A graph-theoretic model to solve the approximate string matching problem allowing for translocations. J Discrete Algorithms 23:143–156
Lipsky O, Porat B, Porat E, Shalom BR, Tzur A (2010) String matching with up to k swaps and mismatches. Inf Comput 208(9):1020–1030
Susik R (2017) Applying a q-gram based multiple string matching algorithm for approximate matching. In: Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska 7
Kim H (2021) A k-mismatch string matching for generalized edit distance using diagonal skipping method. PLoS ONE 16(5):0251047
Nakano K (2012) Efficient implementations of the approximate string matching on the memory machine models. In: 2012 Third International Conference on Networking and Computing, IEEE, pp 233–239
Ho T, Oh S-R, Kim H (2017) A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PLoS ONE 12(10):0186251
Sadiq MU, Yousaf MM, Aslam L, Aleem M, Sarwar S, Jaffry SW (2019) Nvpd: novel parallel edit distance algorithm, correctness, and performance evaluation. Cluster Comput. https://doi.org/10.1007/s10586-019-02962-w
Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequences. Commun ACM 18(6):341–343
Saccharomyces Genome Database. http://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna (2022)
Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC (2010) mrsfast: a cache-oblivious algorithm for short-read mapping. Nat Methods 7(8):576–577
Luo R, Wong T, Zhu J, Liu C-M, Zhu X, Wu E, Lee L-K, Lin H, Zhu W, Cheung DW et al (2013) Soap3-DP: fast, accurate and sensitive GPU-based short read aligner. PLoS ONE 8(5):65632
Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM (JACM) 21(1):168–173. https://doi.org/10.1145/321796.321811
National Center for Biotechnology Information (NCBI). ftp://ftp.ncbi.nlm.nih.gov/ (2022)
Faro S, Lecroq T, Borzì S, Mauro SD, Maggio A (2016) The string matching algorithms research tool. In: Holub J, Žďárek J (eds) Proceedings of the Prague Stringology Conference 2016, Czech Technical University in Prague, Czech Republic, pp 99–111
Ayad LA, Pissis SP, Retha A (2016) libflasm: a software library for fixed-length approximate string matching. BMC Bioinform 17(1):1–12
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
MUS and MMY have contributed equally to the manuscript.
Corresponding author
Ethics declarations
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Competing interests
The authors declare that they have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sadiq, M.U., Yousaf, M.M. Space-efficient computation of parallel approximate string matching. J Supercomput 79, 9093–9126 (2023). https://doi.org/10.1007/s11227-022-05038-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-05038-6