Synthetic Protein Sequence Oversampling Method for Classification and Remote Homology Detection in Imbalanced Protein Data

Beigi, Majid M.; Zell, Andreas

doi:10.1007/978-3-540-71233-6_21

Majid M. Beigi¹ &
Andreas Zell¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4414))

Included in the following conference series:

International Conference on Bioinformatics Research and Development

1159 Accesses
2 Citations

Abstract

Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Synthetic Protein Sequence Oversampling (SPSO) method involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. We used G-protein-coupled receptors families as real data to classify them at subfamily and sub-subfamily levels (having low number of sequences) and could get better accuracy and Matthew’s correlation coefficient than other previously published method. We also made artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernel for svm protein classification. In: Advances in Neural Information Processing System, pp. 1441–1448 (2003)
Google Scholar
Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4(3), 195–203 (2005)
Google Scholar
Pazzini, M., Marz, C., Murphi, P., Ali, K., Hume, T., Bruk, C.: Reducing misclassification costs. In: Proceedings of the Eleventh Int. Conf. on Machine Learning, pp. 217–225 (1994)
Google Scholar
Japkowicz, N., Myers, C., Gluch, M.: A novelty detection approach to classification. In: Proceeding of the Fourteenth Int. Joint Conf. on Artificial Inteligence, pp. 10–15 (1995)
Google Scholar
Japkowicz, N.: Learning from imbalanved data sets: A comparison of various strategies. In: Proceedings of Learning from Imbalanced Data, pp. 10–15 (2000)
Google Scholar
Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)
Google Scholar
Wu, G., Chang, E.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II,Washington, DC (2003)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence and Research 16, 321–357 (2002)
MATH Google Scholar
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002)
Google Scholar
Saigo, H., Vert, J.P., Ueda, N., Akustu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)
Article Google Scholar
Thompson, J.D., Higgins, D.G., Gibson, T.J.: Clustalw: improving the sesitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)
Article Google Scholar
Attwood, T.K, Croning, M.D.R., Gaulton, A.: Deriving structural and functional insights from a ligand-based hierarchical classification of g-protein coupled receptors. Protein Eng. 15, 7–12 (2002)
Article Google Scholar
Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohhen, F.E., Vriend, G.: Gpcrdb information system for g protein-coupled receptors. Nucleic Acids Res. 31(1), 294–297 (2003)
Article Google Scholar
Bairoch, A., Apweiler, R.: The swiss-prot protein sequence data bank and its supplement trembl. Nucleic Acids Res. 29, 346–349 (2001)
Article Google Scholar
Vert, J.-P., Saigo, H., Akustu, T.: Convolution and local alignment kernel. In: Schoelkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Compuatational Biology, MIT Press, Cambridge
Google Scholar
Joachims, T.: Macking large scale svm learning practical. Technical Report LS8-24, Universitat Dortmond (1998)
Google Scholar
Provost, F., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 423, 203–231 (2001)
Article Google Scholar
Swet, J.: Measuring the accuracy of diagnostic systems. Science 240, 1285–1293 (1988)
Article MathSciNet Google Scholar
Bhasin, M., Raghava, G.P.S.: Gpcrpred: an svm-based method for prediction of families and subfamilies of g-protein coupled receptors. Nucleaic Acids res. 32, 383–389 (2004)
Article Google Scholar
Karchin, R., Karplus, K., Haussler, D.: Classifying g-protein coupled receptors with support vector machines. Bioinformatics 18(1), 147–159 (2002)
Article Google Scholar
Huang, Y., Cai, J., Li, Y.D.: Classifying g-protein coupled receptors with bagging classification tree. Computationa Biology and Chemistry 28, 275–280 (2004)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Tübingen, Center for Bioinformatics Tübingen (ZBIT), Sand 1, D-72076 Tübingen, Germany
Majid M. Beigi & Andreas Zell

Authors

Majid M. Beigi
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Zell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Sepp Hochreiter Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beigi, M.M., Zell, A. (2007). Synthetic Protein Sequence Oversampling Method for Classification and Remote Homology Detection in Imbalanced Protein Data. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-540-71233-6_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71232-9
Online ISBN: 978-3-540-71233-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics