Skip to main content
Log in

An exact parallel algorithm to compare very long biological sequences in clusters of workstations

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Biological Sequence Comparison is one of the most important operations in Computational Biology since it is used to determine how similar two sequences are. Smith and Waterman proposed an exact algorithm (SW), based on dynamic programming, that is able to obtain the best local alignment between two sequences in quadratic time and space.

In order to compare long biological sequences, SW is rarely used since the computation time and the amount of memory required becomes prohibitive. For this reason, heuristic methods like BLAST are widely used. Although faster, these heuristic methods do not guarantee that the best result will be produced.

In this paper, we propose an exact parallel variant of the SW algorithm that obtains the best local alignments in quadratic time and reduced space. The results obtained in two clusters (8-machine and 16-machine) for DNA sequences longer than 32 KBP (kilo base-pairs) were very close to linear and, in some cases, superlinear. For very long DNA sequences (1.6 MBP), we were able to reduce execution time from 12.25 hours to 1.54 hours, in our 8-machine cluster. As far as we know, this is the first time 1.6 MBP sequences are compared with an exact SW variant. In this case, 30240 best local alignments were obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molec. Biol. 214, 403–410 (1990)

    Google Scholar 

  2. Batista, R.B., Silva, D.N., Melo, A.C.M.A., Weigang, L.: Using a dsm application to locally align dna sequences. In: Proc. of the IEEE/ACM Int. Symp. on Cluster Computing and the Grid. IEEE Computer Society, Los Alamitos (2004)

    Google Scholar 

  3. Boukerche, A., Melo, A.C.M.A.: Computational Molecular Biology. In: Zomaya, A.Y. (ed.) Parallel Computing for Bioinformatics and Computational Biology, pp. 149–165. Wiley Interscience, Hoboken (2006)

    Google Scholar 

  4. Boukerche, A., Melo, A.C.M.A., Ayala-Rincon, M., Santana, T.M.: Parallel smith-waterman algorithm for local dna comparison in a cluster of workstations. In: 4th Int. Workshop on Experimental and Efficient Algorithms. Lecture Notes in Computer Science, vol. 3530, pp. 464–475. Springer, Heidelberg (2005)

    Google Scholar 

  5. Boukerche, A., Melo, A.C.M.A., Walter, M.E.M.T., Melo, R.C.F., Santana, M.N.P., Batista, R.B.: Performance evaluation of a local dna sequence alignment algorithm on a cluster of workstations. In: Proc. of the Int. Parallel and Distributed Processing Symposium (IPDPS2004). IEEE Society, Los Alamitos (2004)

    Google Scholar 

  6. Chang, W.I., Lawler, E.W.: Approximate string matching in sublinear expected time. In: IEEE Thirty-first Annual Symposium on Foundations of Computer Science, 1990, pp. 116–124

  7. Chen, C., Schmidt, B.: Computing large-scale alignments on a multi-cluster. In: IEEE International Conference on Cluster Computing, 2003

  8. Fickett, J.: Fast optimal alignments. Nucleic Acids Res. 12(1), 175–179 (1984)

    Article  Google Scholar 

  9. Galper, A.R., Brutlag, D.R.: Parallel similarity search and alignment with the dynamic programming method. Technical Report KSL 90-74, Stanford University, 1990, pp. 1–14

  10. Gusfield D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Press Syndicate of the University of Cambridge, New York (1997)

    MATH  Google Scholar 

  11. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  12. Hu, S., Shi, W., Tang, Z.: Jiajia: An svm system based on a new cache coherence protocol. In: High Performance Computing and Networking (HPCN), pp. 463–472. Springer, Heidelberg (1999)

    Google Scholar 

  13. Huang, X., Miller, W.: A time efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12, 337–357 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  14. Landau, G., Viskin, U.: Introducing efficient parallelism into approximate string matching and new serial algorithm. In: 18th ACM STOC, 1986, pp. 220–230

  15. Martins, W.S., Cuvillo, J.B.D., Useche, F.J., Theobald, K.B., Gao, G.R.: A multithread parallel implementation of a dynamic programming algorithm for sequence comparison. In: Brazilian Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2001, pp. 1–8

  16. Melo, R., Walter, M.E.T., Melo, A.C.M.A., Batista, R.B.: Comparing two long dna sequences using a dsm system. In: Euro-Par 2003: Parallel Processing. Lecture Notes in Computer Science, vol. 2790, pp. 517–524. Springer, Heidelberg (2003)

    Google Scholar 

  17. Myers, E.W.: An o(nd) difference algorithm and its variations. Algorithmica 1, 251–266 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  18. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  19. NCBI: Ncbi homepage. Website, http://www.ncbi.nlm.nih.gov/, Nov. 2004

  20. NCBI: Submit to genbank. Website, http://www.ncbi.nlm.nih.gov/Genbank/index.html, Nov. 2004

  21. Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)

    Article  Google Scholar 

  22. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence analysis. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988)

    Article  Google Scholar 

  23. Pfister, G.: In: Search of Clusters—The Coming Battle for Lowly Parallel Computing. Prentice-Hall, Upper Saddle River (1995)

    Google Scholar 

  24. Rajko, S., Aluru, S.: Space and time optimal parallel sequence alignments. IEEE Trans. Parallel Distributed Syst. 15(12), 1070–1081 (2004)

    Article  Google Scholar 

  25. Setubal J.C., Meidanis J.: Introduction to Computational Molecular Biology. Brooks/Cole, Boston (1997)

    Google Scholar 

  26. Shao, G.: Adaptive scheduling of master/worker applications on distributed computational resources. PhD thesis, University of California at San Diego (2001)

  27. Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  28. Tang, P., Yew, P.C.: Processor self-scheduling for multiple nested parallel loops. In: Int. Conf. on Parallel Processing (ICPP), 1986, pp. 528–535

  29. Ukkonen, E.: Algorithms for approximate string matching. Inf. Control 64(1), 100–118 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  30. Zhang, F., Qiao, X., Liu, Z.: A parallel smith waterman algorithm based on divide and conquer. In: Fifth Int. Conf. on Algorithm and Architectures for Parallel Processing (ICA3PP02), pp. 162–169. IEEE Society, Los Alamitos (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Azzedine Boukerche.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boukerche, A., de Melo, A.C.M.A., Sandes, E.F.d.O. et al. An exact parallel algorithm to compare very long biological sequences in clusters of workstations. Cluster Comput 10, 187–202 (2007). https://doi.org/10.1007/s10586-007-0020-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-007-0020-0

Keywords

Navigation