Abstract
In the last few years, molecular biology has produced a large amount of data, mainly in the form of sequences, that is, strings over an alphabet of four (DNA/RNA) or twenty symbols (proteins). For computational biologists the main challenge now is to provide efficient tools for the analysis and the comparison of the sequences. In this paper, we introduce and briefly discuss some open problems, and present a parallel algorithm that finds repeated substrings in a DNA sequence or common substrings in a set of sequences. The occurrences of the substrings can be approximate, that is, can differ up to a maximum number of mismatches that depends on the length of the substring itself. The output of the algorithm is sorted according to different statistical measures of significance. The algorithm has been successfully implemented on a cluster of workstations.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
M. D. Adams et al. The genome sequence of Drosophila Melanogaster. Science 287 (2000), pp. 2185–2195.
The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis Thaliana. Nature, 408 (2000), pp. 796–815.
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, 1997.
M. F. Sagot. Spelling approximate repeated or common motifs using a suffix tree. Proc. of Latin ⊃8, Springer Verlag LNCS 1380, pages 111–127, 1998.
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14 (1995), pp. 249–260.
P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11, 1973.
R. S. Boyer, J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20 (1977), pp. 762–772.
D. E. Knuth, J. H. Morris, V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6 (1977), pp. 323–350.
G. Reinert, S. Scabath, M.S. Waterman. Probabilistic and statistical properties of words. Journal of Computational Biology, 7 (2000), pp. 1–48.
A. Apostolico, M. E. Bock, S. Lonardi, X. Xu. Efficient detection of unusual words. Journal of Computational Biology, 7 (2000), pp. 71–94.
G. Pavesi, G. Mauri, G. Pesole. An algorithm for finding signals of unknown length in DNA sequences. In Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology (ISMB 2001), to appear.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mauri, G., Pavesi, G. (2001). Parallel Algorithms for the Analysis of Biological Sequences. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2001. Lecture Notes in Computer Science, vol 2127. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44743-1_48
Download citation
DOI: https://doi.org/10.1007/3-540-44743-1_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42522-9
Online ISBN: 978-3-540-44743-6
eBook Packages: Springer Book Archive