Signature Recognition Methods for Identifying Influenza Sequences

Keinduangjun, Jitimon; Piamsa-nga, Punpiti; Poovorawan, Yong

doi:10.1007/11527770_67

Jitimon Keinduangjun²¹,
Punpiti Piamsa-nga²¹ &
Yong Poovorawan²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3581))

Included in the following conference series:

Conference on Artificial Intelligence in Medicine in Europe

1221 Accesses
1 Citations

Abstract

Basically, one of the most important issues for identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within considerably appropriate time usually compromises with accuracy. We propose novel approaches for accurately identifying DNA sequences in shorter time by discovering sequence patterns – signatures, which are enough distinctive information for the sequence identification. The approaches are to find the best combination of n-gram patterns and six statistical scoring algorithms, which are regularly used in the research of Information Retrieval, and then employ the signatures to create a similarity scoring model for identifying the DNA. We generate two approaches to discover the signatures. For the first one, we use only statistical information extracted directly from the sequences to discover the signatures. For the second one, we use prior knowledge of the DNA in the signature discovery process. From our experiments on influenza virus, we found that: 1) our technique can identify the influenza virus at the accuracy of up to 99.69% when 11-gram is used and the prior knowledge is applied; 2) the use of too short or too long signatures produces lower efficiency; and 3) most scoring algorithms are good for identification except the “Rocchio algorithm” where its results are approximately 9% lower than the others. Moreover, this technique can be applied for identifying other organisms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aalbersberg, I.: A Document Retrieval Model Based on Term Frequency Ranks. In: Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172 (1994)
Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Research 28(1), 15–18 (2000)
Article Google Scholar
Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-Based N-Gram Models of Natural Language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature Selection for Genetic Sequence Classification. Bioinformatics Journal 14(2), 139–143 (1998)
Article Google Scholar
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151 (1997)
Google Scholar
Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y.: Models for Discovering Signatures in DNA Sequences. In: Proceedings of the 3rd IASTED International Conference on Biomedical Engineering, Innsbruck, Austria, pp. 548–553 (2005)
Google Scholar
Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for Identifying Gene and Protein Names in Journal Articles. Gene 259(1-2), 245–252 (2000)
Article Google Scholar
Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working Notes of Learning from Text and the Web. Conference on Automated Learning and Discovery. Carnegie Mellon University, Pittsburgh (1998)
Google Scholar
Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In: Proceedings of the 16th International Conference on Machine Learning, pp. 258–267 (1999)
Google Scholar
Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Spitters, M.: Comparing Feature Sets for Learning Text Categorization. In: Proceedings on RIAO (2000)
Google Scholar
Wang, J.T.L., Rozen, S., Shapiro, B.A., Shasha, D., Wang, Z., Yin, M.: New Techniques for DNA Sequence Classification. Journal of Computational Biology 6(2), 209–218 (1999)
Article Google Scholar
Xu, Y., Mural, R., Einstein, J., Shah, M., Uberbacher, E.: Grail: A Multiagent Neural Network System for Gene Identification. Proceedings of IEEE 84(10), 1544–1552 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, 10900, Thailand
Jitimon Keinduangjun & Punpiti Piamsa-nga
Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, 10400, Thailand
Yong Poovorawan

Authors

Jitimon Keinduangjun
View author publications
You can also search for this author in PubMed Google Scholar
Punpiti Piamsa-nga
View author publications
You can also search for this author in PubMed Google Scholar
Yong Poovorawan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information and Knowledge Engineering, Danube University Krems, Dr.-Karl-Dorrek-Str. 30, 3500, Krems, Austria
Silvia Miksch
Department of Computing Science, University of Aberdeen, AB24 3UE, Aberdeen, UK
Jim Hunter
Department of Computer Science, University of Cyprus, P.O.Box 20537, CY-1678, Nicosia, Cyprus
Elpida T. Keravnou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y. (2005). Signature Recognition Methods for Identifying Influenza Sequences. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds) Artificial Intelligence in Medicine. AIME 2005. Lecture Notes in Computer Science(), vol 3581. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527770_67

Download citation

DOI: https://doi.org/10.1007/11527770_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27831-3
Online ISBN: 978-3-540-31884-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics