Abstract
The increasing growth of biological sequence data demands better and efficient analysis methods. Effective detection of various regulatory signals in these sequences requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the surrounding region of the regulatory signals. A higher order Markov model is generally regarded as a useful technique for modeling higher order dependencies of the nucleotides. However, its implementation requires estimating a large number of computationally expensive parameters. In this paper, we propose a hybrid method consisting of a first order Markov model for sequence data preprocessing and a multilayer perceptron neural network for classification. The Markov model captures the compositional features and dependencies of nucleotides in terms of probabilistic parameters which are used as inputs to the classifier. The classifier combines the Markov probabilities nonlinearly for signal detection. When applied to the splice site detection problem using three widely used data sets, it is observed that the proposed hybrid method is able to model higher order dependencies with better classification accuracies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Burset, M., Seledtsov, A., Solovyeva, V.V.: Analysis of Canonical and Non-Canonical Splice Sites in Mammalian Genomes. Nucleic Acids Research 28, 4364–4375 (2000)
Chen, T.M., Lu, C.C., Li, W.H.: Prediction of Splice Sites with Dependency Graphs and Their Expanded Bayesian Networks. Bioinformatics 21, 471–482 (2005)
Burge, C., Karlin, S.: Prediction of Complete Gene Structure in Human Genomic DNA. Journal of Molecular Biology 268, 78–94 (1997)
Pertea, M., Lin, X.Y., Salzberg, S.L.: GeneSplicer: A New Computational Method for Splice Site Detection. Nucleic Acids Research 29, 1185–1190 (2001)
Marashi, S.A., Eslahchi, C., Pezeshk, H., Sadeghi, M.: Impact of RNA Structure on the Prediction of Donor and Acceptor Splice Sites. BMC Bioinformatics 7, 297 (2006)
Salzberg, S.: A Method for Identifying Splice Sites and Translation Start Site in Eukaryotic mRNA. Computer Applications in the Biosciences 13, 384–390 (1997)
Zhang, M., Marr, T.: A Weight Array Method for Splicing Signal Analysis. Comput Appl. Biosci. 9, 499–509 (1993)
Castelo, R., Guigo, R.: Splice Site Identification by idlBNs. Bioinformatics 20, 69–76 (2004)
Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling Splice Sites with Bayes Networks. Bioinformatics 16, 152–158 (2000)
Staden, R.: The Current Status and Portability of Our Sequence Handling Software. Nucleic Acids Research 14, 217–231 (1986)
Reese, M.G., Eeckman, F., Kupl, D., Haussler, D.: Improved Splice Site Detection in Genie. Journal of Computational Biology 4, 311–324 (1997)
Brunak, S., Engelbrecht, J., Knudsen, S.: Prediction of mRNA Donor and Acceptor Sites From the DNA Sequence. Journal of Molecular Biology 220, 49–65 (1991)
Zhang, X., Katherine, A.H., Ilana, H., Christina, S.L., Lawrence, A.C.: Sequence Information for the Splicing of Human Pre-mRNA Identified by Support Vector Machine Classification. Genome Research 13, 2637–2650 (2003)
Sun, Y.F., Fan, X.D., Li, Y.D.: Identifying Splicing Sites in Eukaryotic RNA: Support Vector Machine Approach. Computers in biology and medicine 33, 17–29 (2003)
Sonnenburg, S.: New Methods for Detecting Splice Junction Sites in DNA Sequence. Master’s Thesis, Humbold University, Germany (2002)
Chuang, J.S., Roth, D.: Splice Site Prediction using a Sparse Network of Winnows. Technical Report, University of Illinois, Urbana-Champaign (2001)
Zhang, L., et al.: Splice Site Prediction with Quadratic Discriminant Analysis using Diversity Measure. Nucleic Acids Research 31, 6214–6220 (2003)
Arita, M., Tsuda, K., Asai, K.: Modeling Splicing Sites with Pairwise Correlations. Bioinformatics 18, 27–34 (2002)
Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P., Brunak, S.: Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information. Nucleic Acids Research 24, 3439–3452 (1996)
Rajapakse, J.C., Loi, S.H.: Markov Encoding for Eetecting Signals in Genomic Sequences. IEEE/ACM Trans. Computational Biology and Bioinformatics 2, 131–142 (2005)
Loi, S.H., Rajapakse, J.C.: Splice Site Detection with a Higher-Order Markov Model Implemented on a Neural Network. Genome Informatics 14, 64–72 (2003)
Schukat, T.E., Gallwitz, F., Harbeck, S., Warnke, V.: Rational Interpolation of Maximum Likelihood Predictors in Stochastic Language Modeling. In: Proc. of European Conference on Speech Communications and Technology, vol. 5, pp. 2731–2734 (1997)
Pinkus, A.: Approximation Theory of the MLP Model in Neural Networks. Acta Numerica, 143–195 (1999)
Pollastro, P., Rampone, S.: HS3D-Homo Sapiens Splice Sites Dataset. Nucleic Acids Research 2003 (Annual Database Issue)
Baten, A.K.M., Chang, B.C.H., Halgamuge, S.K., Li, J.: Splice Site Identification using Probabilistic Parameters and SVM Classification. BMC Bioinformatics 7 (Suppl. 5), S15 (2006)
Halgamuge, S.K., Glesner, M.: Fuzzy Neural Networks Between Functional Equivalence and Applicability. Int. J. Neural Systems 6, 185–196 (1995)
Halgamuge, S.K.: Trainable Transparent Universal Approximator for Defuzzification in Mamdani-type Neuro-Fuzzy Controllers. IEEE Trans. Fuzzy Systems 6, 304–314 (1998)
Halgamuge, S.K., Glesner, M.: Neural Networks in Designing Fuzzy Systems for Real World Applications. Fuzzy Sets and Systems 65, 1–12 (1994)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Baten, A.K.M.A., Halgamuge, S.K., Chang, B., Wickramarachchi, N. (2007). Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds) Advances in Neural Networks – ISNN 2007. ISNN 2007. Lecture Notes in Computer Science, vol 4492. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72393-6_144
Download citation
DOI: https://doi.org/10.1007/978-3-540-72393-6_144
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72392-9
Online ISBN: 978-3-540-72393-6
eBook Packages: Computer ScienceComputer Science (R0)