Skip to main content

A New Feature Selection Methodology for K-mers Representation of DNA Sequences

  • Conference paper
  • First Online:
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2014)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8623))

Abstract

DNA sequence decomposition into k-mers and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compare sequences in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of a fixed length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence analysis. Moreover, the presence of possible noisy features can also affect the classification accuracy. In this paper we propose a feature selection method able to select the most informative k-mers associated to a set of DNA sequences. Such selection is based on the Motif Independent Measure (MIM), an unbiased quantitative measure for DNA sequence specificity that we have recently introduced in the literature. Results computed on public datasets show the effectiveness of the proposed feature selection method.

The original version of this chapter was revised: The given name and family name of the authors has been corrected. The erratum to this chapter is available at DOI: 10.1007/978-3-319-24462-4_26

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S., Gish, W., Miller, W., et al.: Basic local alignment search tool. J. Mol. Biol. 25(3), 403–410 (1990)

    Article  Google Scholar 

  2. Lipman, D., Pearson, W.: Rapid and sensitive protein similarity searches. Science 227(4693) (1985)

    Google Scholar 

  3. Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)

    Article  Google Scholar 

  4. Yuan, G.-C., Liu, J.S.: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput. Biol. 4(1), e13 (2008)

    Google Scholar 

  5. Lee, D., Karchin, R., Beer, M.A.: Discriminative prediction of mammalian enhancers from DNA sequence. Genome Research 21(12), 2167–2180 (2011)

    Article  Google Scholar 

  6. Pinello, L., Xu, J., Orkin, S.H., Yuan, G.-C.: Analysis of chromatin-state plasticity identifies cell-type specific regulators of H3K27me3 patterns. Proceedings of the National Academy of Sciences 111(3), 344–353 (2014)

    Article  Google Scholar 

  7. Paszkiewicz, K., Studholme, D.J.: De novo assembly of short sequence reads. Briefings in Bioinformatics 11(5), 457–472 (2010)

    Article  Google Scholar 

  8. Liu, Y., Guo, J., Hu, G.-Q., Zhu, H.: Gene prediction in metagenomic fragments based on the svm algorithm. BMC Bioinformatics 14(S-5), S12 (2013)

    Google Scholar 

  9. Drancourt, M., Berger, P., Raoult, D.: Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans. Journal of Clinical Microbiology 42(5), 2197–2202 (2004)

    Article  Google Scholar 

  10. https://rdp.cme.msu.edu/

  11. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge Univ. Press (2000)

    Google Scholar 

  12. Kornberg, R.D., Lorch, Y.: Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell 98, 285–294 (1999)

    Article  Google Scholar 

  13. Struhl, K., Segal, E.: Determinants of nucleosome positioning. Nat. Struct. Mol. Biol. 20(3), 267–273 (2013)

    Article  Google Scholar 

  14. Yuan, G.-C., Liu, Y.-J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J., Rando, O.J.: Genome-scale identification of nucleosome positions in S. cerevisiae. Science 309(5734), 626–630 (2005)

    Article  Google Scholar 

  15. Di Gesú, V., Lo Bosco, G., Pinello, L., Yuan, G.-C., Corona, D.V.F.: A multi-layer method to study genome-scale positions of nucleosomes. Genomics 93(2), 140–145 (2009)

    Article  Google Scholar 

  16. Guo, S.-H., Deng, E.-Z., Xu, L.-Q., Ding, H., Lin, H., Chen, W., Chou, K.-C.: iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30(11), 1522–1529 (2014)

    Article  Google Scholar 

  17. Pinello, L., Lo Bosco, G., Yuan, G.-C.: Applications of alignment-free methods in epigenomics. Briefings in Bioinformatics 15(3), 419–430 (2013)

    Article  Google Scholar 

  18. Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms for Molucular Biology 3(13), 1–9 (2008)

    Google Scholar 

  19. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1–2), 273–324 (1997)

    Article  MATH  Google Scholar 

  20. Saeys, Y., Inza, I., Larrañaga, P.: A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Lo Bosco, G., Pinello, L. (2015). A New Feature Selection Methodology for K-mers Representation of DNA Sequences. In: DI Serio, C., Liò, P., Nonis, A., Tagliaferri, R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2014. Lecture Notes in Computer Science(), vol 8623. Springer, Cham. https://doi.org/10.1007/978-3-319-24462-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24462-4_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24461-7

  • Online ISBN: 978-3-319-24462-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics