Skip to main content

Synthetic Protein Sequence Oversampling Method for Classification and Remote Homology Detection in Imbalanced Protein Data

  • Conference paper
Bioinformatics Research and Development (BIRD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4414))

Included in the following conference series:

Abstract

Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Synthetic Protein Sequence Oversampling (SPSO) method involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. We used G-protein-coupled receptors families as real data to classify them at subfamily and sub-subfamily levels (having low number of sequences) and could get better accuracy and Matthew’s correlation coefficient than other previously published method. We also made artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernel for svm protein classification. In: Advances in Neural Information Processing System, pp. 1441–1448 (2003)

    Google Scholar 

  2. Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4(3), 195–203 (2005)

    Google Scholar 

  3. Pazzini, M., Marz, C., Murphi, P., Ali, K., Hume, T., Bruk, C.: Reducing misclassification costs. In: Proceedings of the Eleventh Int. Conf. on Machine Learning, pp. 217–225 (1994)

    Google Scholar 

  4. Japkowicz, N., Myers, C., Gluch, M.: A novelty detection approach to classification. In: Proceeding of the Fourteenth Int. Joint Conf. on Artificial Inteligence, pp. 10–15 (1995)

    Google Scholar 

  5. Japkowicz, N.: Learning from imbalanved data sets: A comparison of various strategies. In: Proceedings of Learning from Imbalanced Data, pp. 10–15 (2000)

    Google Scholar 

  6. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)

    Google Scholar 

  7. Wu, G., Chang, E.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II,Washington, DC (2003)

    Google Scholar 

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence and Research 16, 321–357 (2002)

    MATH  Google Scholar 

  9. Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002)

    Google Scholar 

  10. Saigo, H., Vert, J.P., Ueda, N., Akustu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)

    Article  Google Scholar 

  11. Thompson, J.D., Higgins, D.G., Gibson, T.J.: Clustalw: improving the sesitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)

    Article  Google Scholar 

  12. Attwood, T.K, Croning, M.D.R., Gaulton, A.: Deriving structural and functional insights from a ligand-based hierarchical classification of g-protein coupled receptors. Protein Eng. 15, 7–12 (2002)

    Article  Google Scholar 

  13. Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohhen, F.E., Vriend, G.: Gpcrdb information system for g protein-coupled receptors. Nucleic Acids Res. 31(1), 294–297 (2003)

    Article  Google Scholar 

  14. Bairoch, A., Apweiler, R.: The swiss-prot protein sequence data bank and its supplement trembl. Nucleic Acids Res. 29, 346–349 (2001)

    Article  Google Scholar 

  15. Vert, J.-P., Saigo, H., Akustu, T.: Convolution and local alignment kernel. In: Schoelkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Compuatational Biology, MIT Press, Cambridge

    Google Scholar 

  16. Joachims, T.: Macking large scale svm learning practical. Technical Report LS8-24, Universitat Dortmond (1998)

    Google Scholar 

  17. Provost, F., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 423, 203–231 (2001)

    Article  Google Scholar 

  18. Swet, J.: Measuring the accuracy of diagnostic systems. Science 240, 1285–1293 (1988)

    Article  MathSciNet  Google Scholar 

  19. Bhasin, M., Raghava, G.P.S.: Gpcrpred: an svm-based method for prediction of families and subfamilies of g-protein coupled receptors. Nucleaic Acids res. 32, 383–389 (2004)

    Article  Google Scholar 

  20. Karchin, R., Karplus, K., Haussler, D.: Classifying g-protein coupled receptors with support vector machines. Bioinformatics 18(1), 147–159 (2002)

    Article  Google Scholar 

  21. Huang, Y., Cai, J., Li, Y.D.: Classifying g-protein coupled receptors with bagging classification tree. Computationa Biology and Chemistry 28, 275–280 (2004)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Sepp Hochreiter Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Beigi, M.M., Zell, A. (2007). Synthetic Protein Sequence Oversampling Method for Classification and Remote Homology Detection in Imbalanced Protein Data. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71233-6_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71232-9

  • Online ISBN: 978-3-540-71233-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics