Abstract
This paper investigates the problem of extracting sequence features that can be useful in the construction of prediction models. The method introduced in this paper generates such features by considering contiguous subsequences and their mutations, and by selecting those candidate features that have a strong association with the classification target according to the Gini index. Experimental results on three genetic data sets provide evidence of the superiority of this method over other sequence feature generation methods from the li-terature, especially in domains where presence, not specific location, of features within a sequence is pertinent for classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Xing, Z., Pei, J., Keogh, E.: A brief survey on sequence classification. ACM SIGKDD Explor. 12(1), 40–48 (2010)
Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199), 843–848 (1995)
Ji, X., Bailey, J., Dong, G.: Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of the Fifth IEEE International Conference on Data Mining (2005)
Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature selection for genetic sequence classification. Bioinformatics 14(2), 139–143 (1998)
Huang, S.-H., Liu, R.-S., Chen, C.-Y., Chao, Y.-T., Chen, S.-Y.: Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions. In: Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE 2005) (2005)
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)
Amaldi, E., Kann, V.: On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theor. Comput. Sci. 209(1–2), 237–260 (1998)
Kohavi, R., Johnb, G.H.: Wrappers for feature selection. Artif. Intell. 97(1–2), 273–324 (1997)
Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International FLAIRS Conference, Orlando (1999)
Gini, C.: Italian: Variabilità e mutabilità “(Variability and Mutability),” C. Cuppini, Bologna, p. 156. In: Pizetti, E., Salvemini, T. (eds.) Memorie di metodologica statistica. Libreria Eredi Virgilio Veschi, Rome (1912) (1955, reprinted)
Dong, G., Pei, J.: Sequence Data Mining, pp. 47–65. Springer, US (2009)
Park, K.-J., Kanehisa, M.: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19(13), 1656–1663 (2003)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2013). (http://archive.ics.uci.edu/ml)
Wan, H., Barrett, G., Ruiz, C., Ryder, E.F.: Mining association rules that incorporate transcription factor binding sites and gene expression patterns in C. elegans. In: Proceeding Fourth International Conference on Bioinformatics Models, Methods and Algorithms BIOINFORMATICS2013, pp. 81–89. SciTePress, Barcelona (2013)
WormBase, 1 April 2012. (http://www.wormbase.org/)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Hawley, D.K., McClure, W.R.: Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res. 11(8), 2237–2255 (1983)
Harley, C.B., Reynolds, R.P.: Analysis of E. coli promoter sequences. Nucleic Acids Res. 15(5), 2343–2361 (1987)
Towell, G.G., Shavlik, J.W., Noordewier, M.O.: Refinement of approximate domain theories by knowledge-based neural networks. In: Proceedings of the Eighth National Conference on Artificial Intelligence (1990)
Noordewier, M.O., Towell, G.G., Shavlik, J.W.: Training knowledge-based neural networks to recognize genes in DNA sequences. Adv. Neural Inf. Process. Syst. 3, 530–536 (1991)
Mah, K., Tu, D.K., Johnsen, R.C., Chu, J.S., Chen, N., Baillie, D.L.: Characterization of the octamer, a cis-regulatory element that modulates excretory cell gene-expression in Caenorhabditis elegans. BMC Mol. Biol. 11(1), 19 (2010)
Reece-Hoyes, J.S., Shingles, J., Dupuy, D., Grove, C.A., Walhout, A.J., Vidal, M., Hope, I.A.: Insight into transcription factor gene duplication from Caenorhabditis elegans Promoterome-driven expression patterns. BMC Genom. 8(1), 27 (2007)
Ao, W., Gaudet, J., Kent, W., Muttumu, S., Mango, S.E.: Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305, 1743–1746 (2004)
Tan, P.-N., Kumar, V., Steinbach, M.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wan, H., Ruiz, C., Beck, J. (2015). Automatic Extraction of Highly Predictive Sequence Features that Incorporate Contiguity and Mutation. In: Plantier, G., Schultz, T., Fred, A., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2014. Communications in Computer and Information Science, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-319-26129-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-26129-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26128-7
Online ISBN: 978-3-319-26129-4
eBook Packages: Computer ScienceComputer Science (R0)