A New Feature Selection Methodology for K-mers Representation of DNA Sequences

Lo Bosco, Giosuè; Pinello, Luca

doi:10.1007/978-3-319-24462-4_9

Giosuè Lo Bosco^17,18 &
Luca Pinello^19,20

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8623))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

1437 Accesses
7 Citations

An erratum to this publication is available online at https://doi.org/10.1007/978-3-319-24462-4_26

Abstract

DNA sequence decomposition into k-mers and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compare sequences in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of a fixed length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence analysis. Moreover, the presence of possible noisy features can also affect the classification accuracy. In this paper we propose a feature selection method able to select the most informative k-mers associated to a set of DNA sequences. Such selection is based on the Motif Independent Measure (MIM), an unbiased quantitative measure for DNA sequence specificity that we have recently introduced in the literature. Results computed on public datasets show the effectiveness of the proposed feature selection method.

The original version of this chapter was revised: The given name and family name of the authors has been corrected. The erratum to this chapter is available at DOI: 10.1007/978-3-319-24462-4_26

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S., Gish, W., Miller, W., et al.: Basic local alignment search tool. J. Mol. Biol. 25(3), 403–410 (1990)
Article Google Scholar
Lipman, D., Pearson, W.: Rapid and sensitive protein similarity searches. Science 227(4693) (1985)
Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)
Article Google Scholar
Yuan, G.-C., Liu, J.S.: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput. Biol. 4(1), e13 (2008)
Google Scholar
Lee, D., Karchin, R., Beer, M.A.: Discriminative prediction of mammalian enhancers from DNA sequence. Genome Research 21(12), 2167–2180 (2011)
Article Google Scholar
Pinello, L., Xu, J., Orkin, S.H., Yuan, G.-C.: Analysis of chromatin-state plasticity identifies cell-type specific regulators of H3K27me3 patterns. Proceedings of the National Academy of Sciences 111(3), 344–353 (2014)
Article Google Scholar
Paszkiewicz, K., Studholme, D.J.: De novo assembly of short sequence reads. Briefings in Bioinformatics 11(5), 457–472 (2010)
Article Google Scholar
Liu, Y., Guo, J., Hu, G.-Q., Zhu, H.: Gene prediction in metagenomic fragments based on the svm algorithm. BMC Bioinformatics 14(S-5), S12 (2013)
Google Scholar
Drancourt, M., Berger, P., Raoult, D.: Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans. Journal of Clinical Microbiology 42(5), 2197–2202 (2004)
Article Google Scholar
https://rdp.cme.msu.edu/
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge Univ. Press (2000)
Google Scholar
Kornberg, R.D., Lorch, Y.: Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell 98, 285–294 (1999)
Article Google Scholar
Struhl, K., Segal, E.: Determinants of nucleosome positioning. Nat. Struct. Mol. Biol. 20(3), 267–273 (2013)
Article Google Scholar
Yuan, G.-C., Liu, Y.-J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J., Rando, O.J.: Genome-scale identification of nucleosome positions in S. cerevisiae. Science 309(5734), 626–630 (2005)
Article Google Scholar
Di Gesú, V., Lo Bosco, G., Pinello, L., Yuan, G.-C., Corona, D.V.F.: A multi-layer method to study genome-scale positions of nucleosomes. Genomics 93(2), 140–145 (2009)
Article Google Scholar
Guo, S.-H., Deng, E.-Z., Xu, L.-Q., Ding, H., Lin, H., Chen, W., Chou, K.-C.: iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30(11), 1522–1529 (2014)
Article Google Scholar
Pinello, L., Lo Bosco, G., Yuan, G.-C.: Applications of alignment-free methods in epigenomics. Briefings in Bioinformatics 15(3), 419–430 (2013)
Article Google Scholar
Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms for Molucular Biology 3(13), 1–9 (2008)
Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1–2), 273–324 (1997)
Article MATH Google Scholar
Saeys, Y., Inza, I., Larrañaga, P.: A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica e Informatica, Universitá degli studi di Palermo, Palermo, Italy
Giosuè Lo Bosco
Dipartimento di Scienze per l’Innovazione e le Tecnologie Abilitanti, Istituto Euro Mediterraneo di Scienza e Tecnologia, Palermo, Italy
Giosuè Lo Bosco
Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
Luca Pinello
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
Luca Pinello

Authors

Giosuè Lo Bosco
View author publications
You can also search for this author in PubMed Google Scholar
Luca Pinello
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CUSSB, University "Vita-Salute" San Raffae, Milano, Italy
Clelia DI Serio
The Computer Laboratory, University of Cambridge, Cambridge, United Kingdom
Pietro Liò
CUSSB, Università Vita-Salute San Raffaele, Milano, Italy
Alessandro Nonis
Dipartimento di Informatica, Universitá degli Studi di Salerno, Fisciano, Salerno, Italy
Roberto Tagliaferri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lo Bosco, G., Pinello, L. (2015). A New Feature Selection Methodology for K-mers Representation of DNA Sequences. In: DI Serio, C., Liò, P., Nonis, A., Tagliaferri, R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2014. Lecture Notes in Computer Science(), vol 8623. Springer, Cham. https://doi.org/10.1007/978-3-319-24462-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-24462-4_9
Published: 18 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24461-7
Online ISBN: 978-3-319-24462-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics