Abstract
Biological Sequence Comparison is one of the most important operations in Computational Biology since it is used to determine how similar two sequences are. Smith and Waterman proposed an exact algorithm (SW), based on dynamic programming, that is able to obtain the best local alignment between two sequences in quadratic time and space.
In order to compare long biological sequences, SW is rarely used since the computation time and the amount of memory required becomes prohibitive. For this reason, heuristic methods like BLAST are widely used. Although faster, these heuristic methods do not guarantee that the best result will be produced.
In this paper, we propose an exact parallel variant of the SW algorithm that obtains the best local alignments in quadratic time and reduced space. The results obtained in two clusters (8-machine and 16-machine) for DNA sequences longer than 32 KBP (kilo base-pairs) were very close to linear and, in some cases, superlinear. For very long DNA sequences (1.6 MBP), we were able to reduce execution time from 12.25 hours to 1.54 hours, in our 8-machine cluster. As far as we know, this is the first time 1.6 MBP sequences are compared with an exact SW variant. In this case, 30240 best local alignments were obtained.
Similar content being viewed by others
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molec. Biol. 214, 403–410 (1990)
Batista, R.B., Silva, D.N., Melo, A.C.M.A., Weigang, L.: Using a dsm application to locally align dna sequences. In: Proc. of the IEEE/ACM Int. Symp. on Cluster Computing and the Grid. IEEE Computer Society, Los Alamitos (2004)
Boukerche, A., Melo, A.C.M.A.: Computational Molecular Biology. In: Zomaya, A.Y. (ed.) Parallel Computing for Bioinformatics and Computational Biology, pp. 149–165. Wiley Interscience, Hoboken (2006)
Boukerche, A., Melo, A.C.M.A., Ayala-Rincon, M., Santana, T.M.: Parallel smith-waterman algorithm for local dna comparison in a cluster of workstations. In: 4th Int. Workshop on Experimental and Efficient Algorithms. Lecture Notes in Computer Science, vol. 3530, pp. 464–475. Springer, Heidelberg (2005)
Boukerche, A., Melo, A.C.M.A., Walter, M.E.M.T., Melo, R.C.F., Santana, M.N.P., Batista, R.B.: Performance evaluation of a local dna sequence alignment algorithm on a cluster of workstations. In: Proc. of the Int. Parallel and Distributed Processing Symposium (IPDPS2004). IEEE Society, Los Alamitos (2004)
Chang, W.I., Lawler, E.W.: Approximate string matching in sublinear expected time. In: IEEE Thirty-first Annual Symposium on Foundations of Computer Science, 1990, pp. 116–124
Chen, C., Schmidt, B.: Computing large-scale alignments on a multi-cluster. In: IEEE International Conference on Cluster Computing, 2003
Fickett, J.: Fast optimal alignments. Nucleic Acids Res. 12(1), 175–179 (1984)
Galper, A.R., Brutlag, D.R.: Parallel similarity search and alignment with the dynamic programming method. Technical Report KSL 90-74, Stanford University, 1990, pp. 1–14
Gusfield D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Press Syndicate of the University of Cambridge, New York (1997)
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Hu, S., Shi, W., Tang, Z.: Jiajia: An svm system based on a new cache coherence protocol. In: High Performance Computing and Networking (HPCN), pp. 463–472. Springer, Heidelberg (1999)
Huang, X., Miller, W.: A time efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12, 337–357 (1991)
Landau, G., Viskin, U.: Introducing efficient parallelism into approximate string matching and new serial algorithm. In: 18th ACM STOC, 1986, pp. 220–230
Martins, W.S., Cuvillo, J.B.D., Useche, F.J., Theobald, K.B., Gao, G.R.: A multithread parallel implementation of a dynamic programming algorithm for sequence comparison. In: Brazilian Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2001, pp. 1–8
Melo, R., Walter, M.E.T., Melo, A.C.M.A., Batista, R.B.: Comparing two long dna sequences using a dsm system. In: Euro-Par 2003: Parallel Processing. Lecture Notes in Computer Science, vol. 2790, pp. 517–524. Springer, Heidelberg (2003)
Myers, E.W.: An o(nd) difference algorithm and its variations. Algorithmica 1, 251–266 (1986)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
NCBI: Ncbi homepage. Website, http://www.ncbi.nlm.nih.gov/, Nov. 2004
NCBI: Submit to genbank. Website, http://www.ncbi.nlm.nih.gov/Genbank/index.html, Nov. 2004
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence analysis. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988)
Pfister, G.: In: Search of Clusters—The Coming Battle for Lowly Parallel Computing. Prentice-Hall, Upper Saddle River (1995)
Rajko, S., Aluru, S.: Space and time optimal parallel sequence alignments. IEEE Trans. Parallel Distributed Syst. 15(12), 1070–1081 (2004)
Setubal J.C., Meidanis J.: Introduction to Computational Molecular Biology. Brooks/Cole, Boston (1997)
Shao, G.: Adaptive scheduling of master/worker applications on distributed computational resources. PhD thesis, University of California at San Diego (2001)
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Tang, P., Yew, P.C.: Processor self-scheduling for multiple nested parallel loops. In: Int. Conf. on Parallel Processing (ICPP), 1986, pp. 528–535
Ukkonen, E.: Algorithms for approximate string matching. Inf. Control 64(1), 100–118 (1985)
Zhang, F., Qiao, X., Liu, Z.: A parallel smith waterman algorithm based on divide and conquer. In: Fifth Int. Conf. on Algorithm and Architectures for Parallel Processing (ICA3PP02), pp. 162–169. IEEE Society, Los Alamitos (2002)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Boukerche, A., de Melo, A.C.M.A., Sandes, E.F.d.O. et al. An exact parallel algorithm to compare very long biological sequences in clusters of workstations. Cluster Comput 10, 187–202 (2007). https://doi.org/10.1007/s10586-007-0020-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-007-0020-0