Skip to main content

Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification

  • Conference paper
Advances in Neural Networks – ISNN 2007 (ISNN 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4492))

Included in the following conference series:

Abstract

The increasing growth of biological sequence data demands better and efficient analysis methods. Effective detection of various regulatory signals in these sequences requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the surrounding region of the regulatory signals. A higher order Markov model is generally regarded as a useful technique for modeling higher order dependencies of the nucleotides. However, its implementation requires estimating a large number of computationally expensive parameters. In this paper, we propose a hybrid method consisting of a first order Markov model for sequence data preprocessing and a multilayer perceptron neural network for classification. The Markov model captures the compositional features and dependencies of nucleotides in terms of probabilistic parameters which are used as inputs to the classifier. The classifier combines the Markov probabilities nonlinearly for signal detection. When applied to the splice site detection problem using three widely used data sets, it is observed that the proposed hybrid method is able to model higher order dependencies with better classification accuracies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Burset, M., Seledtsov, A., Solovyeva, V.V.: Analysis of Canonical and Non-Canonical Splice Sites in Mammalian Genomes. Nucleic Acids Research 28, 4364–4375 (2000)

    Article  Google Scholar 

  2. Chen, T.M., Lu, C.C., Li, W.H.: Prediction of Splice Sites with Dependency Graphs and Their Expanded Bayesian Networks. Bioinformatics 21, 471–482 (2005)

    Article  Google Scholar 

  3. Burge, C., Karlin, S.: Prediction of Complete Gene Structure in Human Genomic DNA. Journal of Molecular Biology 268, 78–94 (1997)

    Article  Google Scholar 

  4. Pertea, M., Lin, X.Y., Salzberg, S.L.: GeneSplicer: A New Computational Method for Splice Site Detection. Nucleic Acids Research 29, 1185–1190 (2001)

    Article  Google Scholar 

  5. Marashi, S.A., Eslahchi, C., Pezeshk, H., Sadeghi, M.: Impact of RNA Structure on the Prediction of Donor and Acceptor Splice Sites. BMC Bioinformatics 7, 297 (2006)

    Article  Google Scholar 

  6. Salzberg, S.: A Method for Identifying Splice Sites and Translation Start Site in Eukaryotic mRNA. Computer Applications in the Biosciences 13, 384–390 (1997)

    Google Scholar 

  7. Zhang, M., Marr, T.: A Weight Array Method for Splicing Signal Analysis. Comput Appl. Biosci. 9, 499–509 (1993)

    Google Scholar 

  8. Castelo, R., Guigo, R.: Splice Site Identification by idlBNs. Bioinformatics 20, 69–76 (2004)

    Article  Google Scholar 

  9. Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling Splice Sites with Bayes Networks. Bioinformatics 16, 152–158 (2000)

    Article  Google Scholar 

  10. Staden, R.: The Current Status and Portability of Our Sequence Handling Software. Nucleic Acids Research 14, 217–231 (1986)

    Article  Google Scholar 

  11. Reese, M.G., Eeckman, F., Kupl, D., Haussler, D.: Improved Splice Site Detection in Genie. Journal of Computational Biology 4, 311–324 (1997)

    Article  Google Scholar 

  12. Brunak, S., Engelbrecht, J., Knudsen, S.: Prediction of mRNA Donor and Acceptor Sites From the DNA Sequence. Journal of Molecular Biology 220, 49–65 (1991)

    Article  Google Scholar 

  13. Zhang, X., Katherine, A.H., Ilana, H., Christina, S.L., Lawrence, A.C.: Sequence Information for the Splicing of Human Pre-mRNA Identified by Support Vector Machine Classification. Genome Research 13, 2637–2650 (2003)

    Article  Google Scholar 

  14. Sun, Y.F., Fan, X.D., Li, Y.D.: Identifying Splicing Sites in Eukaryotic RNA: Support Vector Machine Approach. Computers in biology and medicine 33, 17–29 (2003)

    Article  Google Scholar 

  15. Sonnenburg, S.: New Methods for Detecting Splice Junction Sites in DNA Sequence. Master’s Thesis, Humbold University, Germany (2002)

    Google Scholar 

  16. Chuang, J.S., Roth, D.: Splice Site Prediction using a Sparse Network of Winnows. Technical Report, University of Illinois, Urbana-Champaign (2001)

    Google Scholar 

  17. Zhang, L., et al.: Splice Site Prediction with Quadratic Discriminant Analysis using Diversity Measure. Nucleic Acids Research 31, 6214–6220 (2003)

    Article  Google Scholar 

  18. Arita, M., Tsuda, K., Asai, K.: Modeling Splicing Sites with Pairwise Correlations. Bioinformatics 18, 27–34 (2002)

    Google Scholar 

  19. Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P., Brunak, S.: Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information. Nucleic Acids Research 24, 3439–3452 (1996)

    Article  Google Scholar 

  20. Rajapakse, J.C., Loi, S.H.: Markov Encoding for Eetecting Signals in Genomic Sequences. IEEE/ACM Trans. Computational Biology and Bioinformatics 2, 131–142 (2005)

    Article  Google Scholar 

  21. Loi, S.H., Rajapakse, J.C.: Splice Site Detection with a Higher-Order Markov Model Implemented on a Neural Network. Genome Informatics 14, 64–72 (2003)

    Google Scholar 

  22. Schukat, T.E., Gallwitz, F., Harbeck, S., Warnke, V.: Rational Interpolation of Maximum Likelihood Predictors in Stochastic Language Modeling. In: Proc. of European Conference on Speech Communications and Technology, vol. 5, pp. 2731–2734 (1997)

    Google Scholar 

  23. Pinkus, A.: Approximation Theory of the MLP Model in Neural Networks. Acta Numerica, 143–195 (1999)

    Google Scholar 

  24. Pollastro, P., Rampone, S.: HS3D-Homo Sapiens Splice Sites Dataset. Nucleic Acids Research 2003 (Annual Database Issue)

    Google Scholar 

  25. Baten, A.K.M., Chang, B.C.H., Halgamuge, S.K., Li, J.: Splice Site Identification using Probabilistic Parameters and SVM Classification. BMC Bioinformatics 7 (Suppl. 5), S15 (2006)

    Google Scholar 

  26. Halgamuge, S.K., Glesner, M.: Fuzzy Neural Networks Between Functional Equivalence and Applicability. Int. J. Neural Systems 6, 185–196 (1995)

    Article  Google Scholar 

  27. Halgamuge, S.K.: Trainable Transparent Universal Approximator for Defuzzification in Mamdani-type Neuro-Fuzzy Controllers. IEEE Trans. Fuzzy Systems 6, 304–314 (1998)

    Article  Google Scholar 

  28. Halgamuge, S.K., Glesner, M.: Neural Networks in Designing Fuzzy Systems for Real World Applications. Fuzzy Sets and Systems 65, 1–12 (1994)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Derong Liu Shumin Fei Zengguang Hou Huaguang Zhang Changyin Sun

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Baten, A.K.M.A., Halgamuge, S.K., Chang, B., Wickramarachchi, N. (2007). Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds) Advances in Neural Networks – ISNN 2007. ISNN 2007. Lecture Notes in Computer Science, vol 4492. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72393-6_144

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72393-6_144

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72392-9

  • Online ISBN: 978-3-540-72393-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics