Abstract
Basically, one of the most important issues in identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within a considerably appropriate time span is usually in conflict with accuracy. We propose a novel approach for accurate identification of DNA sequences in shorter time by discovering sequence patterns – signatures, which are sufficiently distinctive information for the identity of a sequence. The approach is to discover the signatures from the best combination of n-gram patterns and statistics-based models, which are regularly used in the research of Information Retrieval, and then use the signatures to create identifiers. We evaluate the performance of all identifiers on three different types of organisms and three different numbers of identification classes. The experimental results showed that the difference of organisms has no effect on the performance of the proposed model; whereas the different numbers of classes slightly affect the performance. The sole use of Information Gain is changed in a small range of n-grams since the use of its pattern absence brings the unbalanced class and pattern score distribution. However, several identifiers provide over 95% and up to 100% of accuracy, when they are constructed by signatures using the appropriate n-grams and statistics-based models. Our proposed model works well in identifying DNA sequences accurately, and it requires less processing time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aalbersberg, I.: A Document Retrieval Model Based on Term Frequency Ranks. In: Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172 (1994)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Research 28(1), 15–18 (2000)
Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-Based N-Gram Models of Natural Language. Computational Linguistics 18(4), 467–479 (1992)
Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature Selection for Genetic Sequence Classification. Bioinformatics Journal 14(2), 139–143 (1998)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151 (1997)
Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y.: Models for Discovering Signatures in DNA Sequences. In: Proceedings of the 3rd IASTED International Conference on Biomedical Engineering, Innsbruck, Austria, pp. 548–553 (2005)
Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for Identifying Gene and Protein Names in Journal Articles. Gene 259(1-2), 245–252 (2000)
Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working Notes of Learning from Text and the Web. Conference on Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh (1998)
Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In: Proceedings of the 16th International Conference on Machine Learning, pp. 258–267 (1999)
Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Spitters, M.: Comparing Feature Sets for Learning Text Categorization. In: Proceedings on RIAO (2000)
Wang, J.T.L., Rozen, S., Shapiro, B.A., Shasha, D., Wang, Z., Yin, M.: New Techniques for DNA Sequence Classification. Journal of Computational Biology 6(2), 209–218 (1999)
Xu, Y., Mural, R., Einstein, J., Shah, M., Uberbacher, E.: Grail: A Multiagent Neural Network System for Gene Identification. Proceedings of the IEEE 84(10), 1544–1552 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y. (2005). DNA Sequence Identification by Statistics-Based Models. In: Wang, L., Jin, Y. (eds) Fuzzy Systems and Knowledge Discovery. FSKD 2005. Lecture Notes in Computer Science(), vol 3614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11540007_134
Download citation
DOI: https://doi.org/10.1007/11540007_134
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28331-7
Online ISBN: 978-3-540-31828-6
eBook Packages: Computer ScienceComputer Science (R0)