Abstract
We report an investigation into how constraint solving techniques can be used to search for patterns in sequences (or strings) of symbols over a finite alphabet. We define a constraint-based structure description language for biosequences, and give the definition of an algorithm to solve the structure searching problem as a CSP. The methodology which we have developed is able to describe two-dimensional structure of biosequences, such as tandem repeats, stem loops, palindromes and pseudo-knots. We also report on an implementation of the language in the constraint logic programming language clp(FD), with test results of a simple searching algorithm, and results from a preliminary implementation in C++ using consistency checking techniques from solving CSP.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abarbanel, R. M., Eiencke, P. R., Mansfield, E., Jaffe, D. A., & Brutlag, D. L. (1984). Rapid searches for complex patterns in biological molecules. Nucleic Acids Research, 12(1): 263–280.
Altman, R. B., Weiser, B., & Noller, H. F. (1994). Constraint satisfaction techniques for modeling large complexes: application to the central domain of 16S ribosomal RNA. In Proceedings Second International Conference on Intelligent Systems for Molecular Biology, pages 10–18. Menlo Park, CA: AAAI Press.
Baldi, P., & Brunak, S. (1998). Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press.
Baranyi, L., Campell, W., Ohshima, K., Fujimoto, S., Boros, M., & Okada, H. (1995). The antisense homology box: a new motif within proteins that encodes biologically active peptides. Nature Medicine, 1(9): 894–901.
Billoud, B., Kontic, M., & Viari, A. (1996). Palingol: a declarative language to describe nucleic acids' secondary structures and to scan sequence databases. Nucleic Acids Research, 24(8): 1395–1403.
Brāzma, A., & Gilbert, D. (1995). A pattern language for molecular biology. Technical Report 11, Department of Computer Science, City University, London.
Brāzma, A., Jonassen, I., Eidhammer, I., & Gilbert, D. (1998). Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2): 277–304.
Clark, D. A., & Rawlings, C. J. (1994). Constraint satisfaction in molecular biology. Tutorial at Second International Conference on Intelligent Systems for Molecular Biology.
Clark, D. A., Rawlings, C. J., & Doursenot, S. (1994). Genetic map construction with constraints. In R. Altman, D. Brutlag, P. Karp, R. Lathrop and D. Searls, eds., Proceedings Second International Conference on Intelligent Systems for Molecular Biology, pages 78–86. Menlo Park, CA: AAAI Press.
Clark, D. A., Rawlings, C. J., Shirazi, J., Veron, A., & Reeve, M. (1993). Protein topology prediction through parallel constraint logic programming. In L. Hunter, D. Searls, and J. Shavlik eds., Proceedings First International Conference on Intelligent Systems for Molecular Biology, pages 83–91. Menlo Park, CA: AAAI Press.
Clark, D. A., Shirazi, J., & Rawlings, C. J. (1992). Protein topology prediction through constraint-based search and the evaluation of topological folding rules. Protein Engineering, 4: 751–760.
Dandekar, T., & Hebtze, M. W. (1995). Finding the hairpin in the haystack: searching for RNA motifs. Trends in Genetics, 11(2): 45–50.
Dandekar, T., & Sibbald, P. R. (1990). Trans-splicing of pre-mRNA is predicted to occur in a wide range of organisms including vertebrates. Nucleic Acids Research, 18(16): 4719–4725.
Diaz, D., & Codognet, P. (1993). A minimal extension of the WAM for clp(FD). In D. S. Warren, ed., Proceedings of the Tenth International Conference on Logic Programming, Budapest, Hungary, pages 774–790. Cambridge, MA: The MIT Press.
Durbin, R., Eddy, S., Krough, A., & Mitchison, G. (1998). Biological Sequence Analysis. Cambridge University Press.
Eddy, S., RNAbob User Guide, unpublished.
Eidhammer, I. (1993). Extending constraint satisfaction problems with value constraints. Technical Report 90, Department of Informatics, University of Bergen.
Foucrault, M., & Major, F. (1995). Symbolic generation and clustering of RNA 3-D motifs. In C. J. Rawlings, D. A. Clark, R. Altman, L. Hunter, T. Lengauer and S. Wodak, eds., Proceedings Third International Conference on Intelligent Systems for Molecular Biology, pages 121–126. Menlo Park, CA: AAAI Press.
Frühwirth, T., Herold, A., Küchenhoff, V., Le Provost, T., Lim, P., Monfroy, E., & Wallace, M. (1992). Constraint logic programming: an informal introduction. In G. Comyn, N. E. Fuchs and M. J. Ratcliffe eds., Logic Programming in Action, LNCS 636, pages 3–35. New York: Springer-Verlag. (Also available as Technical Report ECRC–93–5.)
Gaspin, C., & Westhof, E. (1994). The determination of the secondary structures of RNA as a constraint satisfaction problem. In S. Schultze-Kremer, ed., Advances in Molecular Bioinformatics, pages 103–122. IOS Press.
Gautheret, D., Major, F., & Cedergren, R. (1990). Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Computer Applications in the Biosciences, 6: 325–331.
Gervet, C. (1994). Conjunto: constraint logic programming with finite set domains. In M. Bruynooghe ed., Logic Programming—Proceedings of the 1994 International Symposium, Massachusetts Institute of Technology, pages 339–358. Cambridge, MA: The MIT Press.
Gilbert, D. R., Westhead, D. R., Nagano, N., & Thornton, J. M. (1999). Motif-based searching in tops protein topology databases. Bioinformatics, 15(4): 317–326.
Hamming, R. (1982). Coding and Information Theory. Englewood Cliffs, NJ: Prentice Hall.
Helgesen, C., & Sibbald, P. (1993). PALM—a pattern language for molecular biology. In L. Hunter, D. Searls, and J. Shavlik eds., Proceedings First International Conference on Intelligent Systems for Molecular Biology, pages 172–180. Menlo Park, CA: AAAI Press.
Hentenryck, P. V. (1989). Constraint Satisfaction in Logic Programming. Cambridge MA: MIT Press.
Hentenryck, P. V., & Deville, Y. (1991). The cardinality operator: a new logical connective for constraint logic programming. In Proceedings Eight International Conference on Logic Programming.
Hentenryck, P. V., & Deville, Y. (1991). Operational semantics of constraint logic programming over finite domains. In J. Maluszyński, and M. Wirsing, eds., PLILP91, number 528 in LNCS, pages 395–406. Berlin: Springer-Verlag.
Hofacker, I. L., & Stadler, P. F. (1999). Automatic detection of of conserved base pairing patterns in RNA virus genomes. Computation and Chemistry, 23(3–4): 401–414.
Hofacker, I. L., Fekete, M., Flamm, C., Huynen, M. A., Rauscher, S., Stolorz, P. E., & Stadler, P. F. (1998). Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Research, 26(16): 3825–3836.
Hofman, K., Bucher, P., Felgnet, L., & Bairoch, A. (1999), The PROSITE database, its status in 1999. Nucleic Acids Research, 25(1): 217–221.
Laferrière, A., Gautheret, D., & Cedergren, R. (1994). An RNA pattern matching program with enhanced performance and portability. Computer Applications in the Biosciences, 10(2): 211–212.
Leishman, S., Gray, P. M. D., & Fothergill, J. E. (1995). A constraint-based assignment system for automatic long side chain assignments in protein 2D NMR spectra. In C. J. Rawlings, D. A. Clark, R. Altman, L. Hunter, T. Lengauer, and S. Wodak, eds., Proceedings Third International Conference on Intelligent Systems for Molecular Biology, pages 231–239. Menlo Park, CA: AAAI Press.
Letovsky, S., & Berlyn, M. B. (1992). CPRPO: a rule-based program for constructing genetic maps. Genomics, 12: 435–446.
Levenshtein, V. I. (1965). Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii nauk SSSR (in Russian), 163(4): 845–848. Also in Cybernetics and Control Theory, 10(8): 707–710, 1996.
Major, F., Turcotte, M., Gautheret, D., Lapalme, G., Fillion, E., & Cedergren, R. (1991). The combination of symbolic and numerical computation for 3D modelling of RNA. Science, 253: 1255–1260.
McCaskill, J. S. (1990). The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6–7): 1105–1119.
Mehldau, G., & Myers, G. (1993). A system for pattern matching applications on biosequences. Computer Applications in the Biosciences, 9(3): 299–314.
Nadel, B. A. (1988). Constraint satisfaction algorithms. Technical report, CSC–88–005, Wayne State University.
Nussinov, R. Piecznik, G., Griggs, J. R., & Kleitman, D. J. (1978). Algorithms for loop matchings. SIAM Journal of Applied Mathematics, 35: 68–82.
Parsons, S. (1995). Softening constraints in constraint-based protein topology prediction. In C. J. Rawlings, D. A. Clark, R. Altman, L. Hunter, T. Lengauer and S. Wodak, eds., Proceedings Third International Conference on Intelligent Systems for Molecular Biology, pages 268–276. Menlo Park, CA: AAAI Press.
Pleij, C. W. A. (1994). RNA pseudoknots. Current Opinion in Structural Biology, 4: 337–344.
Rajasekar, A. (1994). Applications in constraint logic programming with strings. In A. Borning, ed., PPCP'94: Second Workshop on Principles and Practice of Constraint Programming, Seattle, WA.
Ratnayake, M. (1996). Constrained pattern recognition in biosequences. B.Eng. (Honours) Degree in Software Engineering, Department of Computer Science, City University, London.
Rivas, E., & Eddy, S. R. (2000). The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics. 16(4): 336–340.
Rivas, E., & Eddy, S. R. (1999). A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology, 285: 2053–2068.
Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjoelander, K., Underwood, R., & Haussler, D. (1994). Stochastic context-free grammars for tRNA modelling. Nucleic Acids Research, 22: 5112–5120.
Searls, D. (1993). The computational linguistics of biological sequences. In L. Hunter, ed., Artificial Intelligence and Molecular Biology, chapter 2, pages 47–120. Menlo Park, CA: AAAI Press.
Searls, D. (1995). The computational linguistics of biological sequences. Tutorial at Third International Conference on Intelligent Systems for Molecular Biology.
Searls, D. (1995). String variable grammar: a logic grammar formalism for the biological language of DNA. Journal of Logic Programming, 24(1–2): 73–102.
Searls, D., & Dong, S. (1993). A syntactic pattern recognition system for DNA sequences. In C. R. Cantor, H. A. Lim, J. Fickett and R. J. Robbins, eds., Proceedings Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis, pages 89–101. Singapore: World Scientific.
Sibbald, P. R., & Argos, P. (1990). Scrutineer: a computer program that flexibly seeks and describes motifs and profiles in protein sequences databases. Computer Applications in the Biosciences, 6(3): 279–288.
Sibbald, P. R., Sommerfeldt, H., & Argos, P. (1992). Overseer: a nucleotide sequence searching tool. Computer Applications in the Biosciences, 8(1): 45–48.
Staden, R. (1990). Searching for patterns in protein and nucleic acid sequencies. In R. F. Doolittle, ed., Methods in Enzymology, Volume 183, pages 193–211. New York: Academic Press.
Stefik, M. (1978). Inferring DNA structures from segmentation data. Artificial Intelligence, 11: 85–114.
Walinsky, C. (1989). CLP(∑*): constraint logic programming with regular sets. In G. Levi and M. Martelli, eds., ICLP'89: Proceedings 6th International Conference on Logic Programming, Lisbon, Portugal, pages 181–196. Cambridge, MA: MIT Press.
Zimmerman, D. E., Kulikowski, C. A., & Montelione, G. T. (1993). A constraint reasoning system for automating sequence-specific resonance assignments from multidimensional protein NMR spectra. In L. Hunter, D. Searls and J. Shavlik, eds., Proceedings First International Conference on Intelligent Systems for Molecular Biology, pages 447–455. Menlo Park, CA: AAAI Press.
Zuker, M. (1989). Computer prediction of RNA structure. Methods in Enzymology, 180: 189–225.
Zuker, M. (1989). On finding all foldings of an RNA molecule. Science, 244: 48–52.
Zuker, M., & Stiegler, P. (1981). Optimal folding of large RNA sequences using thermodynamics and auxilliary information. Nucleic Acids Research, 9: 133–148.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Eidhammer, I., Jonassen, I., Grindhaug, S.H. et al. A Constraint Based Structure Description Language for Biosequences. Constraints 6, 173–200 (2001). https://doi.org/10.1023/A:1011481521835
Issue Date:
DOI: https://doi.org/10.1023/A:1011481521835