Using a Solver Over the String Pattern Domain to Analyze Gene Promoter Sequences

Rigotti, Christophe; Mitašiūnaitė, Ieva; Besson, Jérémy; Meyniel, Laurène; Boulicaut, Jean-François; Gandrillon, Olivier

doi:10.1007/978-1-4419-7738-0_17

Christophe Rigotti⁴,
Ieva Mitašiūnaitė⁵,
Jérémy Besson⁵,
Laurène Meyniel⁴,
Jean-François Boulicaut⁴ &
…
Olivier Gandrillon⁶

657 Accesses

Abstract

This chapter illustrates how inductive querying techniques can be used to support knowledge discovery from genomic data. More precisely, it presents a data mining scenario to discover putative transcription factor binding sites in gene promoter sequences. We do not provide technical details about the used constraintbased data mining algorithms that have been previously described. Our contribution is to provide an abstract description of the scenario, its concrete instantiation and also a typical execution on real data. Our main extraction algorithm is a complete solver dedicated to the string pattern domain: it computes string patterns that satisfy a given conjunction of primitive constraints. We also discuss the processing steps necessary to turn it into a useful tool. In particular, we introduce a parameter tuning strategy, an appropriate measure to rank the patterns, and the post-processing approaches that can be and have been applied.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Document Spanners: From Expressive Power to Decision Problems

Article Open access 22 May 2017

An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithm

Article 01 October 2021

References

Besson, J., Rigotti, C., Mitasiunaité, I., Boulicaut, J.F.: Parameter tuning for differential mining of string patterns. In: Proceedings IEEEWorkshop DDDM’08 co-olocated with ICDM’08, pp. 77–86 (2008)
Google Scholar
Boulicaut, J.F., De Raedt, L., Mannila, H. (eds.): Constraint-Based Mining and Inductive Databases, LNCS, vol. 3848. Springer (2005). 400 pages
Google Scholar
Brazma, A., Jonassen, I., Vilo, J., Ukkonen, E.: Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8(11), 1202–1215 (1998)
Google Scholar
Bresson, C., Keime, C., Faure, C., Letrillard, Y., Barbado, M., Sanfilippo, S., Benhra, N., Gandrillon, O., Gonin-Giraud, S.: Large-scale analysis by SAGE revealed new mechanisms of v-erba oncogene action. BMC Genomics 8(390) (2007)
Google Scholar
Corpet, F.: Multiple sequence alignment with hierarchical clustering. Nucl. Acids Res. 16(22), 10,881–10,890 (1988)
Article Google Scholar
Dan Lee, S., De Raedt, L.: An efficient algorithm for mining string databases under constraints. In: Proceedings KDID’04, pp. 108–129. Springer (2004)
Google Scholar
De Raedt, L.: A perspective on inductive databases. SIGKDD Explorations 4(2), 69–77 (2003)
Google Scholar
De Raedt, L., Jaeger, M., Lee, S.D., Mannila, H.: A theory of inductive query answering. In: Proceedings IEEE ICDM’02, pp. 123–130 (2002)
Google Scholar
Eden, E., Lipson, D., Yogev, S., Yakhini, Z.: Discovering motifs in ranked lists of DNA sequences. PLOS Computational Biology 3(3), 508–522 (2007)
Article MathSciNet Google Scholar
Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. CACM 39(11), 58–64 (1996)
Google Scholar
Keich, U., Pevzner, P.A.: Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18(10), 1382–1390 (2002)
Article Google Scholar
Matys, V., Fricke, E., Geffers, R., G¨ossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V., Kloos, D.U., Land, S., Lewicki-Potapov, B., Michael, H., M¨unch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., E., Wingender: Transfac : transcriptional regulation, from patterns to profiles. Nucl. Acids Res. 31(1), 374–378 (2003)
Google Scholar
Mitasiunaite, I.: Mining string data under similarity and soft-frequency constraints: Application to promoter sequence analysis. Ph.D. thesis, INSA Lyon (2009)
Google Scholar
Mitasiunaite, I., Boulicaut, J.F.: Looking for monotonicity properties of a similarity constraint on sequences. In: Proceedings of ACM SAC’06 Data Mining, pp. 546–552 (2006)
Google Scholar
Mitasiunaite, I., Boulicaut, J.F.: Introducing softness into inductive queries on string databases. In: Databases and Information Systems IV, Frontiers in Artificial Intelligence and Applications, vol. 155, pp. 117–132. IOS Press (2007)
Google Scholar
Mitasiunaite, I., Rigotti, C., Schicklin, S., Meyniel, L., j. F. Boulicaut, Gandrillon, O.: Extracting signature motifs from promoter sets of differentially expressed genes. In Silico Biology 8(43) (2008)
Google Scholar
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
The R Project for Statistical Computing: http://www.r-project.org/
Tompa, M., Li, N., Bailey, T.L., Church, G.M., Moor, B.D., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., RÃ©gnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transciption factor binding sites. Nat. Biotechnol. 23(1), 137–144 (2005)
Article Google Scholar
Vanet, A., Marsan, L., Sagot, M.F.: Promoter sequences and algorithmical methods for identifying them. Res. Microbiol. 150(9-10), 779–799 (1999)
Article Google Scholar
Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.: Serial analysis of gene expression. Science 270(5235), 484–487 (1995)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire LIRIS CNRS UMR 5205, INSA-Lyon, 69621, Villeurbanne, France
Christophe Rigotti, Laurène Meyniel & Jean-François Boulicaut
Faculty of Mathematics and Informatics, Vilnius University, Vilnius, Lithuania
Ieva Mitašiūnaitė & Jérémy Besson
Centre de Génétique Moléculaire et Cellulaire CNRS UMR 5534, Université Claude Bernard Lyon I, 69622, Villeurbanne, France
Olivier Gandrillon

Authors

Christophe Rigotti
View author publications
You can also search for this author in PubMed Google Scholar
Ieva Mitašiūnaitė
View author publications
You can also search for this author in PubMed Google Scholar
Jérémy Besson
View author publications
You can also search for this author in PubMed Google Scholar
Laurène Meyniel
View author publications
You can also search for this author in PubMed Google Scholar
Jean-François Boulicaut
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Gandrillon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christophe Rigotti .

Editor information

Editors and Affiliations

, Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, Ljubljana, 1000, Slovenia
Sašo Džeroski
, Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, Antwerpen, B-2020, Belgium
Bart Goethals
, Dept. of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, Ljubljana, SI-1000, Slovenia
Panče Panov

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rigotti, C., Mitašiūnaitė, I., Besson, J., Meyniel, L., Boulicaut, JF., Gandrillon, O. (2010). Using a Solver Over the String Pattern Domain to Analyze Gene Promoter Sequences. In: Džeroski, S., Goethals, B., Panov, P. (eds) Inductive Databases and Constraint-Based Data Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7738-0_17

Download citation

DOI: https://doi.org/10.1007/978-1-4419-7738-0_17
Published: 18 November 2010
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-7737-3
Online ISBN: 978-1-4419-7738-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using a Solver Over the String Pattern Domain to Analyze Gene Promoter Sequences

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Document Spanners: From Expressive Power to Decision Problems

An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithm

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Using a Solver Over the String Pattern Domain to Analyze Gene Promoter Sequences

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Document Spanners: From Expressive Power to Decision Problems

An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithm

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation