DNA Sequence Identification by Statistics-Based Models

Keinduangjun, Jitimon; Piamsa-nga, Punpiti; Poovorawan, Yong

doi:10.1007/11540007_134

Jitimon Keinduangjun²⁰,
Punpiti Piamsa-nga²⁰ &
Yong Poovorawan²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3614))

Included in the following conference series:

International Conference on Fuzzy Systems and Knowledge Discovery

1304 Accesses
1 Citations

Abstract

Basically, one of the most important issues in identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within a considerably appropriate time span is usually in conflict with accuracy. We propose a novel approach for accurate identification of DNA sequences in shorter time by discovering sequence patterns – signatures, which are sufficiently distinctive information for the identity of a sequence. The approach is to discover the signatures from the best combination of n-gram patterns and statistics-based models, which are regularly used in the research of Information Retrieval, and then use the signatures to create identifiers. We evaluate the performance of all identifiers on three different types of organisms and three different numbers of identification classes. The experimental results showed that the difference of organisms has no effect on the performance of the proposed model; whereas the different numbers of classes slightly affect the performance. The sole use of Information Gain is changed in a small range of n-grams since the use of its pattern absence brings the unbalanced class and pattern score distribution. However, several identifiers provide over 95% and up to 100% of accuracy, when they are constructed by signatures using the appropriate n-grams and statistics-based models. Our proposed model works well in identifying DNA sequences accurately, and it requires less processing time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aalbersberg, I.: A Document Retrieval Model Based on Term Frequency Ranks. In: Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172 (1994)
Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Research 28(1), 15–18 (2000)
Article Google Scholar
Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-Based N-Gram Models of Natural Language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature Selection for Genetic Sequence Classification. Bioinformatics Journal 14(2), 139–143 (1998)
Article Google Scholar
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151 (1997)
Google Scholar
Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y.: Models for Discovering Signatures in DNA Sequences. In: Proceedings of the 3rd IASTED International Conference on Biomedical Engineering, Innsbruck, Austria, pp. 548–553 (2005)
Google Scholar
Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for Identifying Gene and Protein Names in Journal Articles. Gene 259(1-2), 245–252 (2000)
Article Google Scholar
Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working Notes of Learning from Text and the Web. Conference on Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh (1998)
Google Scholar
Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In: Proceedings of the 16th International Conference on Machine Learning, pp. 258–267 (1999)
Google Scholar
Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Spitters, M.: Comparing Feature Sets for Learning Text Categorization. In: Proceedings on RIAO (2000)
Google Scholar
Wang, J.T.L., Rozen, S., Shapiro, B.A., Shasha, D., Wang, Z., Yin, M.: New Techniques for DNA Sequence Classification. Journal of Computational Biology 6(2), 209–218 (1999)
Article Google Scholar
Xu, Y., Mural, R., Einstein, J., Shah, M., Uberbacher, E.: Grail: A Multiagent Neural Network System for Gene Identification. Proceedings of the IEEE 84(10), 1544–1552 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, 10900, Thailand
Jitimon Keinduangjun & Punpiti Piamsa-nga
Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, 10400, Thailand
Yong Poovorawan

Authors

Jitimon Keinduangjun
View author publications
You can also search for this author in PubMed Google Scholar
Punpiti Piamsa-nga
View author publications
You can also search for this author in PubMed Google Scholar
Yong Poovorawan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, Nanyang Avenue, 639798, Singapore
Lipo Wang
Honda Research Institute Europe GmbH, Offenbach/Main, Germany
Yaochu Jin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y. (2005). DNA Sequence Identification by Statistics-Based Models. In: Wang, L., Jin, Y. (eds) Fuzzy Systems and Knowledge Discovery. FSKD 2005. Lecture Notes in Computer Science(), vol 3614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11540007_134

Download citation

DOI: https://doi.org/10.1007/11540007_134
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28331-7
Online ISBN: 978-3-540-31828-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics