Skip to main content
Log in

Self-Organizing Maps of Position Weight Matrices for Motif Discovery in Biological Sequences

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

The identification of overrepresented motifs in a collection of biological sequences continues to be a relevant and challenging problem in computational biology. Currently popular methods of motif discovery are based on statistical learning theory. In this paper, a machine-learning approach to the motif discovery problem is explored. The approach is based on a Self-Organizing Map (SOM) where the output layer neuron weight vectors are replaced by position weight matrices. This approach can be used to characterise features present in a set of sequences, and thus can be used as an aid in overrepresented motif discovery. The SOM approach to motif discovery is demonstrated using biological sequence datasets, both real and simulated

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abe T., Kanaya S., Kinouchi M., Ichiba Y., Kozuki T., Ikemura T. (2003). Informatics for Unveiling Hidden Genome Signatures. Genome Research 13:693–702

    Article  PubMed  Google Scholar 

  • Bailey T.L., Elkan C. (1994). Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers. Proceedings of the International Conference on Intelligent Systems for Molecular Biology 2:8–36

    Google Scholar 

  • Bussemaker H.J., Li H., Siggia E.D. (2000). Building a Dictionary for Genomes: Identification of Presumptive Regulatory Sites by Statistical Analysis. Proceedings of the National Academy of Sciences of the United States of America 97:10096–10100

    Article  PubMed  MathSciNet  Google Scholar 

  • Gupta M., Liu J.S. (2003). Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model. Journal of the American Statistical Association 98:55–66

    MATH  MathSciNet  Google Scholar 

  • Hughes J.D., Estep P.W., Tavazoie S., Church G.M. (2000). Computational Identification of Cis-regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces Cerevisiae. Journal of Molecular Biology 296:1205–1214

    Article  PubMed  Google Scholar 

  • Kanaya S., Kinouchi M., Abe T., Kudo Y., Yamada Y., Nishi T., Mori H., Ikemura T. (2001). Analysis of Codon Usage Diversity of Bacterial Genes with a Self-organizing Map (SOM): Characterization of Horizontally Transferred Genes with Emphasis on the E. coli O157 Genome. Gene 276:89–99

    Article  PubMed  Google Scholar 

  • Kohonen T. (1995). Self-Organizing Maps. Springer-Verlag, Berlin

    Google Scholar 

  • Kohonen T., Somervuo P. (2002). How to Make Large Self-organizing Maps for Nonvectorial Data. Neural Networks 15:945–952

    Article  PubMed  Google Scholar 

  • Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., Wootton J.C. (1993). Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science 262:208–214

    Article  PubMed  Google Scholar 

  • Lawrence C.E., Reilly A.A. (1990). An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. Proteins 7:41–51

    Article  PubMed  Google Scholar 

  • Liu X., Brutlag D.L., Liu J.S. (2001). BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-expressed Genes. Pacific Symposium on Biocomputing 127–138

    Google Scholar 

  • Mahony S., McInerney J.O., Smith T.J., Golden A. (2004). Gene Prediction Using the Self-Organizing Map: Automatic Generation of Multiple Gene Models. BMC Bioinformatics 5:23

    Article  PubMed  Google Scholar 

  • Matys V., Fricke E., Geffers R., Gossling E., Haubrock M., Hehl R., Hornischer K., Karas D., Kel A.E., Kel-Margoulis O.V. et al. (2003). TRANSFAC: Transcriptional Regulation, from Patterns to Profiles. Nucleic Acids Research 31:374–378

    Article  PubMed  Google Scholar 

  • Pevzner P.A., Sze S.H. (2000). Combinatorial Approaches to Finding Subtle Signals in DNA Sequences. Proceedings of the International Conference on Intelligent Systems for Molecular Biology 8:269–278

    Google Scholar 

  • Rigoutsos I., Floratos A. (1998). Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm. Bioinformatics 14:55–67

    Article  PubMed  Google Scholar 

  • Sinha S., Tompa M. (2002). Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation. Nucleic Acids Research 30:5549–5560

    Article  PubMed  Google Scholar 

  • Wan H., Li L., Federhen S., Wootton J.C. (2003). Discovering Simple Regions in Biological Sequences Associated with Scoring Schemes. Journal of Computational Biology 10:171–185

    Article  PubMed  Google Scholar 

  • Wang H.C., Badger J., Kearney P., Li M. (2001). Analysis of Codon Usage Patterns of Bacterial Genomes Using the Self-organizing Map. Molecular Biology and Evolution 18:792–800

    PubMed  Google Scholar 

  • Yang Z.R., Chou K.C. (2003). Mining Biological Data Using Self-organizing Map. Journal of Chemical Information and Computer Science 43:1748–1753

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaun Mahony.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahony, S., Hendrix, D., Smith, T.J. et al. Self-Organizing Maps of Position Weight Matrices for Motif Discovery in Biological Sequences. Artif Intell Rev 24, 397–413 (2005). https://doi.org/10.1007/s10462-005-9011-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-005-9011-9

Keywords

Navigation