Skip to main content

Fast Kernel Methods for SVM Sequence Classifiers

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4645))

Abstract

In this work we study string kernel methods for sequence analysis and focus on the problem of species-level identification based on short DNA fragments known as barcodes. We introduce efficient sorting-based algorithms for exact string k-mer kernels and then describe a divide-and-conquer technique for kernels with mismatches. Our algorithms for mismatch kernel matrix computations improve currently known time bounds for these computations. We then consider the mismatch kernel problem with feature selection, and present efficient algorithms for it. Our experimental results show that, for string kernels with mismatches, kernel matrices can be computed 100-200 times faster than traditional approaches. Kernel vector evaluations on new sequences show similar computational improvements. On several DNA barcode datasets, k-mer string kernels considerably improve identification accuracy compared to prior results. String kernels with feature selection demonstrate competitive performance with substantially fewer computations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hebert, P.D.N., Cywinska, A., Ball, S., deWaard, J.: Biological identifications through DNA barcodes. In: Proceedings of the Royal Society of London, pp. 313–322 (2003)

    Google Scholar 

  2. Armstrong, K., Bal, S.: DNA barcodes for biosecurity: invasive species identification. Philos. R. Soc. Lond. B. Biol. Sci. 360(1462), 1813–1823 (2005)

    Article  Google Scholar 

  3. Steinke, D., Vences, M., Salzburger, W., Meyer, A.: TaxI: a software tool for DNA barcoding using distance methods. Philosophical Transactions of the Royal Society B: Biological Sciences 360(1462), 1975–1980 (2005)

    Article  Google Scholar 

  4. Nielsen, R., Matz, M.: Statistical approaches for DNA barcoding. Systematic Biology 55(1), 162–169 (2006)

    Article  Google Scholar 

  5. Matz, M.V., Nielsen, R.: A likelihood ratio test for species membership based on DNA sequence data. Philosophical Transactions of the Royal Society B: Biological Sciences 360(1462), 1969–1974 (2005)

    Article  Google Scholar 

  6. Meyer, C.P., Paulay, G.: Dna barcoding: error rates based on comprehensive sampling. PLoS Biol. 3(12) (December 2005)

    Google Scholar 

  7. Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002)

    Google Scholar 

  8. Leslie, C.S., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for SVM protein classification. In: Becker, S., Thrun, S., Obermayer, K. (eds.) NIPS, pp. 1417–1424. MIT Press, Cambridge (2002)

    Google Scholar 

  9. Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profile-based string kernels for remote homology detection and motif extraction. In: CSB 2004: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004), Washington, DC, USA, pp. 152–160. IEEE Computer Society Press, Los Alamitos (2004)

    Google Scholar 

  10. Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. Journal of Computational Biology 7(1-2), 95–114 (2000)

    Article  Google Scholar 

  11. Menchetti, S., Costa, F., Frasconi, P.: Weighted decomposition kernels. In: ICML 2005: Proceedings of the 22nd international conference on Machine learning, New York, NY, USA, pp. 585–592. ACM Press, New York (2005)

    Chapter  Google Scholar 

  12. Schölkopf, B., Smola, A.J.: Learning with kernels. MIT Press, Cambridge (2002)

    Google Scholar 

  13. Vapnik, V.: Statistical learning theory. Wiley, Chichester (1998)

    MATH  Google Scholar 

  14. Vishwanathan, S.V.N., Smola, A.J.: Fast kernels for string and tree matching. In: NIPS, pp. 569–576 (2002)

    Google Scholar 

  15. Ukkonen, E.: Constructing suffix trees on-line in linear time. In: Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing 1992, vol. 1, pp. 484–492. North-Holland, Amsterdam (1992)

    Google Scholar 

  16. Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res. 5, 1435–1455 (2004)

    MathSciNet  Google Scholar 

  17. Hebert, P.D.N., Penton, E.H., Burns, J.M., Janzen, D.H., Hallwachs, W.: Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. In: PNAS, vol. 101, pp. 14812–14817 (2004)

    Google Scholar 

  18. Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers & Chemistry 20(1), 25–33 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Raffaele Giancarlo Sridhar Hannenhalli

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kuksa, P., Pavlovic, V. (2007). Fast Kernel Methods for SVM Sequence Classifiers. In: Giancarlo, R., Hannenhalli, S. (eds) Algorithms in Bioinformatics. WABI 2007. Lecture Notes in Computer Science(), vol 4645. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74126-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74126-8_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74125-1

  • Online ISBN: 978-3-540-74126-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics