Skip to main content

Automatic Extraction of Highly Predictive Sequence Features that Incorporate Contiguity and Mutation

  • Conference paper
  • First Online:
Biomedical Engineering Systems and Technologies (BIOSTEC 2014)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 511))

  • 544 Accesses

Abstract

This paper investigates the problem of extracting sequence features that can be useful in the construction of prediction models. The method introduced in this paper generates such features by considering contiguous subsequences and their mutations, and by selecting those candidate features that have a strong association with the classification target according to the Gini index. Experimental results on three genetic data sets provide evidence of the superiority of this method over other sequence feature generation methods from the li-terature, especially in domains where presence, not specific location, of features within a sequence is pertinent for classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Xing, Z., Pei, J., Keogh, E.: A brief survey on sequence classification. ACM SIGKDD Explor. 12(1), 40–48 (2010)

    Article  Google Scholar 

  2. Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199), 843–848 (1995)

    Article  Google Scholar 

  3. Ji, X., Bailey, J., Dong, G.: Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of the Fifth IEEE International Conference on Data Mining (2005)

    Google Scholar 

  4. Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature selection for genetic sequence classification. Bioinformatics 14(2), 139–143 (1998)

    Article  Google Scholar 

  5. Huang, S.-H., Liu, R.-S., Chen, C.-Y., Chao, Y.-T., Chen, S.-Y.: Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions. In: Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE 2005) (2005)

    Google Scholar 

  6. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)

    Article  Google Scholar 

  7. Amaldi, E., Kann, V.: On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theor. Comput. Sci. 209(1–2), 237–260 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  8. Kohavi, R., Johnb, G.H.: Wrappers for feature selection. Artif. Intell. 97(1–2), 273–324 (1997)

    Article  MATH  Google Scholar 

  9. Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International FLAIRS Conference, Orlando (1999)

    Google Scholar 

  10. Gini, C.: Italian: Variabilità e mutabilità “(Variability and Mutability),” C. Cuppini, Bologna, p. 156. In: Pizetti, E., Salvemini, T. (eds.) Memorie di metodologica statistica. Libreria Eredi Virgilio Veschi, Rome (1912) (1955, reprinted)

    Google Scholar 

  11. Dong, G., Pei, J.: Sequence Data Mining, pp. 47–65. Springer, US (2009)

    Google Scholar 

  12. Park, K.-J., Kanehisa, M.: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19(13), 1656–1663 (2003)

    Article  Google Scholar 

  13. Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2013). (http://archive.ics.uci.edu/ml)

  14. Wan, H., Barrett, G., Ruiz, C., Ryder, E.F.: Mining association rules that incorporate transcription factor binding sites and gene expression patterns in C. elegans. In: Proceeding Fourth International Conference on Bioinformatics Models, Methods and Algorithms BIOINFORMATICS2013, pp. 81–89. SciTePress, Barcelona (2013)

    Google Scholar 

  15. WormBase, 1 April 2012. (http://www.wormbase.org/)

  16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  17. Hawley, D.K., McClure, W.R.: Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res. 11(8), 2237–2255 (1983)

    Article  Google Scholar 

  18. Harley, C.B., Reynolds, R.P.: Analysis of E. coli promoter sequences. Nucleic Acids Res. 15(5), 2343–2361 (1987)

    Article  Google Scholar 

  19. Towell, G.G., Shavlik, J.W., Noordewier, M.O.: Refinement of approximate domain theories by knowledge-based neural networks. In: Proceedings of the Eighth National Conference on Artificial Intelligence (1990)

    Google Scholar 

  20. Noordewier, M.O., Towell, G.G., Shavlik, J.W.: Training knowledge-based neural networks to recognize genes in DNA sequences. Adv. Neural Inf. Process. Syst. 3, 530–536 (1991)

    Google Scholar 

  21. Mah, K., Tu, D.K., Johnsen, R.C., Chu, J.S., Chen, N., Baillie, D.L.: Characterization of the octamer, a cis-regulatory element that modulates excretory cell gene-expression in Caenorhabditis elegans. BMC Mol. Biol. 11(1), 19 (2010)

    Article  Google Scholar 

  22. Reece-Hoyes, J.S., Shingles, J., Dupuy, D., Grove, C.A., Walhout, A.J., Vidal, M., Hope, I.A.: Insight into transcription factor gene duplication from Caenorhabditis elegans Promoterome-driven expression patterns. BMC Genom. 8(1), 27 (2007)

    Article  Google Scholar 

  23. Ao, W., Gaudet, J., Kent, W., Muttumu, S., Mango, S.E.: Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305, 1743–1746 (2004)

    Article  Google Scholar 

  24. Tan, P.-N., Kumar, V., Steinbach, M.: Introduction to Data Mining. Addison-Wesley, Boston (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carolina Ruiz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wan, H., Ruiz, C., Beck, J. (2015). Automatic Extraction of Highly Predictive Sequence Features that Incorporate Contiguity and Mutation. In: Plantier, G., Schultz, T., Fred, A., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2014. Communications in Computer and Information Science, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-319-26129-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26129-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26128-7

  • Online ISBN: 978-3-319-26129-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics