Abstract
The DNA motif finding problem is of great relevance in molecular biology. Weak signals that mark transcription factor binding sites involved in gene regulation are considered to be challenging to find. These signals (motifs) consist of a short string of unknown length that can be located anywhere in the gene promoter region. Therefore, the problem consists on discovering short, conserved sites in genomic DNA without knowing, a priori, the length nor the chemical composition of the site, turning the original problem into a combinatorial one, where computational tools can be applied to find the solution. Pevzner and Sze [7], studied a precise combinatorial formulation of this problem, called the planted motif problem, which is of particular interest because it is a challenging model for commonly used motif-finding algorithms [15]. In this work, we analyze two different encoding schemes for genetic algorithms to solve the planted motif finding problem. One representation encodes the initial position for the motif occurrences at each sequence, and the other encodes a candidate motif. We test the performance of both algorithms on a set of planted motif instances. Preliminary experimental results show a promising superior performance of the algorithm encoding the candidate motif over the more standard position based scheme.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)
Blanchette, M., Schwikowski, B., Tompa, M.: Algorithms for philogenetic footprinting. J. Comp. Biol. 9, 211–223 (2002)
Brazma, A., Jonassen, I., Vilo, J., Ukkonen, E.: Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 15, 1202–1215 (1998)
Buhler, J., Martin, T.: Finding Motifs Using Random Projections. Journal of Computational Biology 9(2), 225–242 (2002)
Che, D., Song, Y., Rasheed, K.: MDGA: Motif Discovery Using A Genetic Algorithm. GECCO’05 (June 25-29, 2005)
Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading (1989)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biolofy. Cambridge University Press, Cambridge (1997)
Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–677 (1999)
Jones, N.C., Pevzner, P.A.: Introduction to Bioinformatics Algorithms. MIT Press, Cambridge (2004)
Karaoglu, N., Maurer-Stroh, S., Manderick, B.: GAMOT: An efficient genetic algorithm for finding challenging motifs in DNA sequences. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, Springer, Heidelberg (2006)
Lawrence, C., Altschul, S., Bogusky, M., Liu, J., Neuwald, A., Wootton, J.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 208–214 (1993)
Liu, F.M., Tsai, J.P., Chen, R.M., Chen, S.N., Shih, S.H.: FMGA: finding motifs by genetic algorithm. In (BIBE 2004). IEEE Fourth Symposium on Bioinformatics and Bioengineering, pp. 459–466. IEEE Computer Society Press, Los Alamitos (2004)
Liu, X., Brutlag, D.L., Liu, J.S.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 6, 127–138 (2001)
Pevzner, P., Sze, S.-H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology, pp. 269–278 (2000)
Price, A., Ramabhadram, S., Pevzner, P.: Finding Subtle Motifs by Branching from Sample Strings. Bioinformatics 1(1), 1–7 (2003)
Roth, F.R., Hughes, J.D., Estep, P.E., Church, G.M., Finding, D.N.A.: Regulatory Motifs within unaligned non-coding sequences clustered by whole-Genome mRNA quantitation. Nature Biotechnology 16(10), 939–945 (1998)
Sagot, M.-F.: Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 111–127. Springer, Heidelberg (1998)
Sagot, M.-F., Escalier, V., Viari, A., Soldano, H.: Searching for repeated words in a text allowing for mismatches and gaps. In: Baeza-Yates, R., Manber, U. (eds.) Second South American Workshop on String Processing, Viñas del Mar, Chili, pp. 87–100. University of Chili (1995)
Sinha, S., Tompa, M.: A statistical Method for finding transcription factor binding sites. In: Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology, pp. 344–354 (2000)
Stormo, G.D., Hartzell III, G.W: Identifying protein-binding sites from unaligned DNA fragments. PNAS 86, 1183–1187 (1989)
Stavrovskaya, E.D., Mironov, A.A.: Two genetic algorithms for identification of regulatory signals. In: Silico Biology (2003)
Waterman, M.S., Arratia, R., Galas, D.J.: Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol. 46, 515–527 (1984)
Zaslavsky, E., Singh, M.: A combinatorial optimization approach for diverse motif finding applications. Algorithms for Molecular Biology, 1–13 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martínez-Arellano, G., Brizuela, C.A. (2007). Comparison of Simple Encoding Schemes in GA’s for the Motif Finding Problem: Preliminary Results. In: Sagot, MF., Walter, M.E.M.T. (eds) Advances in Bioinformatics and Computational Biology. BSB 2007. Lecture Notes in Computer Science(), vol 4643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73731-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-73731-5_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73730-8
Online ISBN: 978-3-540-73731-5
eBook Packages: Computer ScienceComputer Science (R0)