Skip to main content
Log in

NvPD: novel parallel edit distance algorithm, correctness, and performance evaluation

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Edit distance has applications in many domains such as bioinformatics, spell checking, plagiarism checking, query optimization, speech recognition, and data mining. Traditionally, edit distance is computed by dynamic programming based sequential solution which becomes infeasible for large problems. In this paper, we introduce NvPD, a novel algorithm for parallel edit distance computation by resolving dependencies in the conventional dynamic programming based solution. We also establish the correctness of modified dependencies. NvPD exhibits certain characteristics such as balanced workload among processors, less synchronization overhead, maximum utilization of resources and it can exploit spatial locality. It requires \(\min (m,n)\) steps to complete as compared to diagonal based approach that completes in \(\max (m,n)\). Experimental evaluation using variety of random and real life data sets over shared memory multi-core systems and graphic processing units (GPUs) show that NvPD outperforms state-of-the-art parallel edit distance algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Aluru, S., Futamura, N., Mehrotra, K.: Parallel biological sequence comparison using prefix computations. J. Parallel Distrib. Comput. 63(3), 264–272 (2003)

    Article  MATH  Google Scholar 

  2. Apostolico, A., Atallah, M.J., Larmore, L.L., McFaddin, S.: Efficient parallel algorithms for string editing and related problems. SIAM J. Comput. 19(5), 968–988 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  3. Beernaerts, J., Debever, E., Lenoir, M., De Baets, B., Van de Weghe, N.: A method based on the levenshtein distance metric for the comparison of multiple movement patterns described by matrix sequences of different length. Expert Syst. Appl. 115, 373–385 (2019)

    Article  Google Scholar 

  4. Behara, K., Bhaskar, A., Chung, E.: Levenshtein distance for the structural comparison of OD matrices. In: Australasian Transport Research Forum (ATRF), 40th, 2018, Darwin, Northern Territory, Australia (2018)

  5. Blelloch, G.E.: Prefix sums and their applications. Tech. rep, Citeseer (1990)

  6. Boukerche, A., de Melo, A.C.M.A., de Oliveira Sandes, E.F., Ayala-Rincon, M.: An exact parallel algorithm to compare very long biological sequences in clusters of workstations. Clust. Comput. 10(2), 187–202 (2007)

    Article  Google Scholar 

  7. Dobrišek, S., Žibert, J., Pavešić, N., Mihelič, F.: An edit-distance model for the approximate matching of timed strings. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 736–741 (2009)

    Article  Google Scholar 

  8. Droppo, J., Acero, A.: Context dependent phonetic string edit distance for automatic speech recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4358–4361. IEEE, Dallas, Texas (2010)

  9. Edmiston, E.W., Core, N.G., Saltz, J.H., Smith, R.M.: Parallel processing of biological sequence comparison algorithms. Int. J. Parallel Program. 17(3), 259–275 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  10. Guo, L., Du, : S., Ren, M., Liu, Y., Li, J., He, J., Tian, N., Li, K.: Parallel algorithm for approximate string matching with k differences. In: IEEE Eighth International Conference on Networking. Architecture and Storage, pp. 257–261. IEEE, Washington, DC (2013)

  11. Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. GPU Gems 3(39), 851–876 (2007)

    Google Scholar 

  12. Heine, J., Sylla, M., Langer, I., Schramm, T., Abendroth, B., Bruder, R.: Algorithm for driver intention detection with fuzzy logic and edit distance. In: IEEE 18th International Conference on Intelligent Transportation Systems (ITSC), pp. 1022–1027. IEEE, Canary Islands (2015)

  13. Hillis, W.D., Steele Jr., G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986)

    Article  Google Scholar 

  14. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  15. Ho, T., Oh, S.R., Kim, H.: A parallel approximate string matching under levenshtein distance on graphics processing units using warp-shuffle operations. PloS ONE 12(10), e0186251 (2017)

    Article  Google Scholar 

  16. Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)

    Article  Google Scholar 

  17. Hyyrö, H.: A bit-vector algorithm for computing levenshtein and damerau edit distances. Nord. J. Comput. 10(1), 29–39 (2003)

    MathSciNet  MATH  Google Scholar 

  18. Jakšić, S., Bartocci, E., Grosu, R., Ničković, D.: Quantitative monitoring of STL with edit distance. In: International Conference on Runtime Verification, pp. 201–218. Springer, Madrid (2016)

  19. Khajeh-Saeed, A., Poole, S., Perot, J.B.: Acceleration of the smith-waterman algorithm using single and multiple graphics processors. J. Comput. Phys. 229(11), 4247–4258 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  20. Korpar, M., Šikić, M.: Sw#-gpu-enabled exact alignments on genome scale. Bioinformatics 29(19), 2494–2495 (2013)

    Article  Google Scholar 

  21. Kotsifakos, A., Papapetrou, P., Hollmén, J., Gunopulos, D., Athitsos, V.: A survey of query-by-humming similarity methods. In: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, p. 5. ACM, New York, NY (2012)

  22. Ksw2: Library for global alignment of biological sequences. https://github.com/lh3/ksw2

  23. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10(3), R25 (2009)

    Article  Google Scholar 

  24. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  25. Lin, C.H., Li, J.C., Liu, C.H., Chang, S.C.: Perfect hashing based parallel algorithms for multiple string matching on graphic processing units. IEEE Trans. Parallel Distrib. Syst. 28(9), 2639–2650 (2017)

    Article  Google Scholar 

  26. Liu, W., Schmidt, B., Voss, G., Muller-Wittig, W.: Streaming algorithms for biological sequence alignment on GPUs. IEEE Trans. Parallel Distrib. Syst. 18(9), 1270–1281 (2007)

    Article  Google Scholar 

  27. Liu, Y., Wirawan, A., Schmidt, B.: Cudasw++ 3.0: accelerating smith-waterman protein database search by coupling CPU and GPU simd instructions. BMC Bioinform. 14(1), 117 (2013)

    Article  Google Scholar 

  28. Lubis, A.H., Ikhwan, A., Kan, P.L.E.: Combination of levenshtein distance and rabin-karp to improve the accuracy of document equivalence level. Int. J. Eng. Technol. 7(2.27), 17–21 (2018)

    Article  Google Scholar 

  29. Mandoiu, I., Zelikovsky, A.: Bioinformatics Algorithms: Techniques and Applications, vol. 3. Wiley, New York (2008)

    Book  MATH  Google Scholar 

  30. Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  31. Mathies, T.R.: A fast parallel algorithm to determine edit distance. Tech. Rep. CMU-CS-88130, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA (1988)

  32. Mitani, Y., Ino, F., Hagihara, K.: Parallelizing exact and approximate string matching via inclusive scan on a GPU. IEEE Trans. Parallel Distrib. Syst. 28(7), 1989–2002 (2017)

    Article  Google Scholar 

  33. Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  34. National center for biotechnology information. ftp://ftp.ncbi.nlm.nih.gov/

  35. Nishimura, T., Bordim, J.L., Ito, Y., Nakano, K.: Accelerating the smith-waterman algorithm using bitwise parallel bulk computation technique on GPU. In: IEEE Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 932–941. IEEE International, Orlando, Florida (2017)

  36. Nolte, J., Horton, P.: Parallel sequence matching with taco’s distributed object groups—a case study from molecular biology. Clust. Comput. 4(1), 71–77 (2001). https://doi.org/10.1023/A:1011468427597

    Article  Google Scholar 

  37. Nvidia, C.: CUDA C programming guide, version 10.1. NVIDIA Corp (2019)

  38. Pirinen, T.A., Lindén, K.: State-of-the-art in weighted finite-state spell-checking. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 519–532. Springer, Kathmandu (2014)

  39. Polyanovsky, V.O., Roytberg, M.A., Tumanyan, V.G.: Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithm. Mol. Biol. 6(1), 25 (2011)

    Article  Google Scholar 

  40. Prasad, D.V.V., Jaganathan, S.: Improving the performance of smith-waterman sequence algorithm on GPU using shared memory for biological protein sequences. Clust. Comput. (2018). https://doi.org/10.1007/s10586-018-2421-7

    Article  Google Scholar 

  41. Rajko, S., Aluru, S.: Space and time optimal parallel sequence alignments. IEEE Trans. Parallel Distrib. Syst. 15(12), 1070–1081 (2004)

    Article  Google Scholar 

  42. Sandes, E.F.D.O., Boukerche, A., Melo, A.C.M.A.D.: Parallel optimal pairwise biological sequence comparison: algorithms, platforms, and classification. ACM Comput. Surv. (CSUR) 48(4), 63 (2016)

    Article  Google Scholar 

  43. Sarje, A., Aluru, S.: Parallel genomic alignments on the cell broadband engine. IEEE Trans. Parallel Distrib. Syst. 20(11), 1600–1610 (2009)

    Article  Google Scholar 

  44. Sellers, P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithm. 1(4), 359–373 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  45. Šošić, M., Šikić, M.: Edlib: a c/c++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33(9), 1394–1395 (2017)

    Article  Google Scholar 

  46. Su, Z., Ahn, B.R., Eom, K.Y., Kang, M.K., Kim, J.P., Kim, M.K.: Plagiarism detection using the levenshtein distance and smith-waterman algorithm. In: 3rd International Conference on Innovative Computing Information and Control, 2008. ICICIC’08, pp. 569–569. IEEE, Washington, DC (2008)

  47. The universal protein resource (uniprot). https://www.uniprot.org/

  48. Torreno, O., Trelles, O.: Two level parallelism and i/o reduction in genome comparisons. Clust. Comput. 20(3), 1925–1936 (2017). https://doi.org/10.1007/s10586-017-0873-9

    Article  Google Scholar 

  49. Ukkonen, E.: Finding approximate patterns in strings. J. Algorithm. 6(1), 132–137 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  50. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  51. Yang, J., Xu, Y., Shang, Y.: An efficient parallel algorithm for longest common subsequence problem on gpus. In: Proceedings of the World Congress on Engineering, vol. 1, pp. 499–504. London (2010)

  52. Ying, Z., Robertazzi, T.G.: Signature searching in a networked collection of files. IEEE Trans. Parallel Distrib. Syst. 25(5), 1339–1348 (2014)

    Article  Google Scholar 

  53. Zhang, J., Lan, H., Chan, Y., Shang, Y., Schmidt, B., Liu, W.: BGSA: a bit-parallel global sequence alignment toolkit for multi-core and many-core architectures. Bioinformatics (2018)

  54. Zhu, Z., Zhou, J., Ji, Z., Shi, Y.H.: Dna sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evolut. Comput. 15(5), 643–658 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Umair Sadiq.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadiq, M.U., Yousaf, M.M., Aslam, L. et al. NvPD: novel parallel edit distance algorithm, correctness, and performance evaluation. Cluster Comput 23, 879–894 (2020). https://doi.org/10.1007/s10586-019-02962-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02962-w

Keywords

Navigation