Skip to main content

Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5251))

Abstract

Large-scale sequencing projects have led to a vast amount of protein sequences, which have to be assigned to functional categories. Currently, profile hidden markov models and kernel-based machine learning methods provide the most accurate results for protein classification. However, the prediction of new sequences with these approaches is computationally expensive. We present an approach for fast scoring of protein sequences by means of feature-based protein sequence representation and multi-class multi-label machine learning techniques. Using the Pfam database, we show that our method provides high computational efficiency and that the approach is well-suitable for pre-filtering of large sequence sets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yooseph, S., et al.: The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol. 5, 16 (2007)

    Article  Google Scholar 

  2. Friedberg, I.: Automated protein function prediction–the genomic challenge. Brief. Bioinformatics 7, 225–242 (2006)

    Article  Google Scholar 

  3. Pandey, G., Kumar, V., Steinbach, M.: Computational approaches for protein function prediction. Technical Report TR 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities (2006)

    Google Scholar 

  4. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Google Scholar 

  5. Finn, R., et al.: Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–251 (2006)

    Article  Google Scholar 

  6. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)

    Article  Google Scholar 

  7. Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 10(6), 857–868 (2003)

    Article  Google Scholar 

  8. Walters, J.P., Meng, X., Chaudhary, V., Oliver, T.F., Yeow, L.Y., Schmidt, B., Nathan, D., Landman, J.I.: MPI-HMMER-Boost: Distributed FPGA Acceleration. VLSI Signal Processing 48(3), 223–238 (2007)

    Article  Google Scholar 

  9. Ong, S., Lin, H., Chen, Y., Li, Z., Cao, Z.: Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8, 300 (2007)

    Article  Google Scholar 

  10. Strope, P., Moriyama, E.: Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors. Genomics 89, 602–612 (2007)

    Article  Google Scholar 

  11. Han, L., Cui, J., Lin, H., Ji, Z., Cao, Z., Li, Y., Chen, Y.: Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 6, 4023–4037 (2006)

    Article  Google Scholar 

  12. Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Pac. Symp. Biocomput., pp. 564–575 (2002)

    Google Scholar 

  13. Ben-Hur, A., Brutlag, D.: Remote homology detection: a motif based approach. Bioinformatics 19 (suppl. 1), 26–33 (2003)

    Article  Google Scholar 

  14. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)

    Article  Google Scholar 

  15. Lingner, T., Meinicke, P.: Remote homology detection based on oligomer distances. Bioinformatics 22(18), 2224–2231 (2006)

    Article  Google Scholar 

  16. Saigo, H., Vert, J.P., Ueda, N., Akutsu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)

    Article  Google Scholar 

  17. Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23), 4239–4247 (2005)

    Article  Google Scholar 

  18. Rifkin, R., Klautau, A.: In Defense of One-Vs-All Classification. Journal of Machine Learning Research 5, 101–141 (2004)

    MathSciNet  Google Scholar 

  19. Jensen, L.J., Gupta, R., Staerfeldt, H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19, 635–642 (2003)

    Article  Google Scholar 

  20. Schapire, R., Singer, Y.: Boostexter: A system for multiclass multi-label text categorization (1998)

    Google Scholar 

  21. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) NIPS, pp. 681–687. MIT Press, Cambridge (2001)

    Google Scholar 

  22. Zhang, M.L., Zhou, Z.H.: A k-nearest neighbor based algorithm for multi-label classification. The IEEE Computational Intelligence Society 2, 718–721 (2005)

    Google Scholar 

  23. Lee, K., Kim, D., Na, D., Lee, K., Lee, D.: PLPD: reliable protein localization prediction from imbalanced and overlapped datasets. Nucleic Acids Res. 34, 4655–4666 (2006)

    Article  Google Scholar 

  24. Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.P.: Protein classification with multiple algorithms. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 448–456. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  25. Rifkin, R., Yeo, G., Poggio, T.: Regularized Least Squares Classification. In: Advances in Learning Theory: Methods, Model and Applications NATO Science Series III: Computer and Systems Sciences, vol. 190, pp. 131–153. IOS Press, Amsterdam (2003)

    Google Scholar 

  26. Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med., 7–18 (2006)

    Google Scholar 

  27. Hoff, K., Tech, M., Lingner, T., Daniel, R., Morgenstern, B., Meinicke, P.: Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics 9, 217 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Keith A. Crandall Jens Lagergren

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lingner, T., Meinicke, P. (2008). Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach. In: Crandall, K.A., Lagergren, J. (eds) Algorithms in Bioinformatics. WABI 2008. Lecture Notes in Computer Science(), vol 5251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87361-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87361-7_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87360-0

  • Online ISBN: 978-3-540-87361-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics