Automatic Extraction of Highly Predictive Sequence Features that Incorporate Contiguity and Mutation

Wan, Hao; Ruiz, Carolina; Beck, Joseph

doi:10.1007/978-3-319-26129-4_14

Hao Wan¹⁴,
Carolina Ruiz¹⁴ &
Joseph Beck¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 511))

Included in the following conference series:

International Joint Conference on Biomedical Engineering Systems and Technologies

544 Accesses

Abstract

This paper investigates the problem of extracting sequence features that can be useful in the construction of prediction models. The method introduced in this paper generates such features by considering contiguous subsequences and their mutations, and by selecting those candidate features that have a strong association with the classification target according to the Gini index. Experimental results on three genetic data sets provide evidence of the superiority of this method over other sequence feature generation methods from the li-terature, especially in domains where presence, not specific location, of features within a sequence is pertinent for classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Xing, Z., Pei, J., Keogh, E.: A brief survey on sequence classification. ACM SIGKDD Explor. 12(1), 40–48 (2010)
Article Google Scholar
Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199), 843–848 (1995)
Article Google Scholar
Ji, X., Bailey, J., Dong, G.: Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of the Fifth IEEE International Conference on Data Mining (2005)
Google Scholar
Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature selection for genetic sequence classification. Bioinformatics 14(2), 139–143 (1998)
Article Google Scholar
Huang, S.-H., Liu, R.-S., Chen, C.-Y., Chao, Y.-T., Chen, S.-Y.: Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions. In: Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE 2005) (2005)
Google Scholar
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)
Article Google Scholar
Amaldi, E., Kann, V.: On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theor. Comput. Sci. 209(1–2), 237–260 (1998)
Article MATH MathSciNet Google Scholar
Kohavi, R., Johnb, G.H.: Wrappers for feature selection. Artif. Intell. 97(1–2), 273–324 (1997)
Article MATH Google Scholar
Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International FLAIRS Conference, Orlando (1999)
Google Scholar
Gini, C.: Italian: Variabilità e mutabilità “(Variability and Mutability),” C. Cuppini, Bologna, p. 156. In: Pizetti, E., Salvemini, T. (eds.) Memorie di metodologica statistica. Libreria Eredi Virgilio Veschi, Rome (1912) (1955, reprinted)
Google Scholar
Dong, G., Pei, J.: Sequence Data Mining, pp. 47–65. Springer, US (2009)
Google Scholar
Park, K.-J., Kanehisa, M.: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19(13), 1656–1663 (2003)
Article Google Scholar
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2013). (http://archive.ics.uci.edu/ml)
Wan, H., Barrett, G., Ruiz, C., Ryder, E.F.: Mining association rules that incorporate transcription factor binding sites and gene expression patterns in C. elegans. In: Proceeding Fourth International Conference on Bioinformatics Models, Methods and Algorithms BIOINFORMATICS2013, pp. 81–89. SciTePress, Barcelona (2013)
Google Scholar
WormBase, 1 April 2012. (http://www.wormbase.org/)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar
Hawley, D.K., McClure, W.R.: Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res. 11(8), 2237–2255 (1983)
Article Google Scholar
Harley, C.B., Reynolds, R.P.: Analysis of E. coli promoter sequences. Nucleic Acids Res. 15(5), 2343–2361 (1987)
Article Google Scholar
Towell, G.G., Shavlik, J.W., Noordewier, M.O.: Refinement of approximate domain theories by knowledge-based neural networks. In: Proceedings of the Eighth National Conference on Artificial Intelligence (1990)
Google Scholar
Noordewier, M.O., Towell, G.G., Shavlik, J.W.: Training knowledge-based neural networks to recognize genes in DNA sequences. Adv. Neural Inf. Process. Syst. 3, 530–536 (1991)
Google Scholar
Mah, K., Tu, D.K., Johnsen, R.C., Chu, J.S., Chen, N., Baillie, D.L.: Characterization of the octamer, a cis-regulatory element that modulates excretory cell gene-expression in Caenorhabditis elegans. BMC Mol. Biol. 11(1), 19 (2010)
Article Google Scholar
Reece-Hoyes, J.S., Shingles, J., Dupuy, D., Grove, C.A., Walhout, A.J., Vidal, M., Hope, I.A.: Insight into transcription factor gene duplication from Caenorhabditis elegans Promoterome-driven expression patterns. BMC Genom. 8(1), 27 (2007)
Article Google Scholar
Ao, W., Gaudet, J., Kent, W., Muttumu, S., Mango, S.E.: Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305, 1743–1746 (2004)
Article Google Scholar
Tan, P.-N., Kumar, V., Steinbach, M.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA, USA
Hao Wan, Carolina Ruiz & Joseph Beck

Authors

Hao Wan
View author publications
You can also search for this author in PubMed Google Scholar
Carolina Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Beck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carolina Ruiz .

Editor information

Editors and Affiliations

ESEO, ANGERS CEDEX 02, France
Guy Plantier
Cognitive Systems Lab., Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Tanja Schultz
Technical University of Lisbon, Lisbon, Portugal
Ana Fred
New University of Lisbon, Lisboa, Portugal
Hugo Gamboa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wan, H., Ruiz, C., Beck, J. (2015). Automatic Extraction of Highly Predictive Sequence Features that Incorporate Contiguity and Mutation. In: Plantier, G., Schultz, T., Fred, A., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2014. Communications in Computer and Information Science, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-319-26129-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-26129-4_14
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26128-7
Online ISBN: 978-3-319-26129-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics