Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach

Lingner, Thomas; Meinicke, Peter

doi:10.1007/978-3-540-87361-7_17

Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach

Thomas Lingner¹ &
Peter Meinicke¹

Conference paper

1010 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5251))

Abstract

Large-scale sequencing projects have led to a vast amount of protein sequences, which have to be assigned to functional categories. Currently, profile hidden markov models and kernel-based machine learning methods provide the most accurate results for protein classification. However, the prediction of new sequences with these approaches is computationally expensive. We present an approach for fast scoring of protein sequences by means of feature-based protein sequence representation and multi-class multi-label machine learning techniques. Using the Pfam database, we show that our method provides high computational efficiency and that the approach is well-suitable for pre-filtering of large sequence sets.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yooseph, S., et al.: The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol. 5, 16 (2007)
Article Google Scholar
Friedberg, I.: Automated protein function prediction–the genomic challenge. Brief. Bioinformatics 7, 225–242 (2006)
Article Google Scholar
Pandey, G., Kumar, V., Steinbach, M.: Computational approaches for protein function prediction. Technical Report TR 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities (2006)
Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Google Scholar
Finn, R., et al.: Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–251 (2006)
Article Google Scholar
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)
Article Google Scholar
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 10(6), 857–868 (2003)
Article Google Scholar
Walters, J.P., Meng, X., Chaudhary, V., Oliver, T.F., Yeow, L.Y., Schmidt, B., Nathan, D., Landman, J.I.: MPI-HMMER-Boost: Distributed FPGA Acceleration. VLSI Signal Processing 48(3), 223–238 (2007)
Article Google Scholar
Ong, S., Lin, H., Chen, Y., Li, Z., Cao, Z.: Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8, 300 (2007)
Article Google Scholar
Strope, P., Moriyama, E.: Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors. Genomics 89, 602–612 (2007)
Article Google Scholar
Han, L., Cui, J., Lin, H., Ji, Z., Cao, Z., Li, Y., Chen, Y.: Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 6, 4023–4037 (2006)
Article Google Scholar
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Pac. Symp. Biocomput., pp. 564–575 (2002)
Google Scholar
Ben-Hur, A., Brutlag, D.: Remote homology detection: a motif based approach. Bioinformatics 19 (suppl. 1), 26–33 (2003)
Article Google Scholar
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)
Article Google Scholar
Lingner, T., Meinicke, P.: Remote homology detection based on oligomer distances. Bioinformatics 22(18), 2224–2231 (2006)
Article Google Scholar
Saigo, H., Vert, J.P., Ueda, N., Akutsu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)
Article Google Scholar
Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23), 4239–4247 (2005)
Article Google Scholar
Rifkin, R., Klautau, A.: In Defense of One-Vs-All Classification. Journal of Machine Learning Research 5, 101–141 (2004)
MathSciNet Google Scholar
Jensen, L.J., Gupta, R., Staerfeldt, H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19, 635–642 (2003)
Article Google Scholar
Schapire, R., Singer, Y.: Boostexter: A system for multiclass multi-label text categorization (1998)
Google Scholar
Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) NIPS, pp. 681–687. MIT Press, Cambridge (2001)
Google Scholar
Zhang, M.L., Zhou, Z.H.: A k-nearest neighbor based algorithm for multi-label classification. The IEEE Computational Intelligence Society 2, 718–721 (2005)
Google Scholar
Lee, K., Kim, D., Na, D., Lee, K., Lee, D.: PLPD: reliable protein localization prediction from imbalanced and overlapped datasets. Nucleic Acids Res. 34, 4655–4666 (2006)
Article Google Scholar
Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.P.: Protein classification with multiple algorithms. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 448–456. Springer, Heidelberg (2005)
Chapter Google Scholar
Rifkin, R., Yeo, G., Poggio, T.: Regularized Least Squares Classification. In: Advances in Learning Theory: Methods, Model and Applications NATO Science Series III: Computer and Systems Sciences, vol. 190, pp. 131–153. IOS Press, Amsterdam (2003)
Google Scholar
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med., 7–18 (2006)
Google Scholar
Hoff, K., Tech, M., Lingner, T., Daniel, R., Morgenstern, B., Meinicke, P.: Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics 9, 217 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Bioinformatics, Institute for Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
Thomas Lingner & Peter Meinicke

Authors

Thomas Lingner
View author publications
You can also search for this author in PubMed Google Scholar
Peter Meinicke
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Keith A. Crandall Jens Lagergren

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lingner, T., Meinicke, P. (2008). Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach. In: Crandall, K.A., Lagergren, J. (eds) Algorithms in Bioinformatics. WABI 2008. Lecture Notes in Computer Science(), vol 5251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87361-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-540-87361-7_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87360-0
Online ISBN: 978-3-540-87361-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics