Abstract
Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Synthetic Protein Sequence Oversampling (SPSO) method involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. We used G-protein-coupled receptors families as real data to classify them at subfamily and sub-subfamily levels (having low number of sequences) and could get better accuracy and Matthew’s correlation coefficient than other previously published method. We also made artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernel for svm protein classification. In: Advances in Neural Information Processing System, pp. 1441–1448 (2003)
Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4(3), 195–203 (2005)
Pazzini, M., Marz, C., Murphi, P., Ali, K., Hume, T., Bruk, C.: Reducing misclassification costs. In: Proceedings of the Eleventh Int. Conf. on Machine Learning, pp. 217–225 (1994)
Japkowicz, N., Myers, C., Gluch, M.: A novelty detection approach to classification. In: Proceeding of the Fourteenth Int. Joint Conf. on Artificial Inteligence, pp. 10–15 (1995)
Japkowicz, N.: Learning from imbalanved data sets: A comparison of various strategies. In: Proceedings of Learning from Imbalanced Data, pp. 10–15 (2000)
Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)
Wu, G., Chang, E.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II,Washington, DC (2003)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence and Research 16, 321–357 (2002)
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002)
Saigo, H., Vert, J.P., Ueda, N., Akustu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: Clustalw: improving the sesitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)
Attwood, T.K, Croning, M.D.R., Gaulton, A.: Deriving structural and functional insights from a ligand-based hierarchical classification of g-protein coupled receptors. Protein Eng. 15, 7–12 (2002)
Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohhen, F.E., Vriend, G.: Gpcrdb information system for g protein-coupled receptors. Nucleic Acids Res. 31(1), 294–297 (2003)
Bairoch, A., Apweiler, R.: The swiss-prot protein sequence data bank and its supplement trembl. Nucleic Acids Res. 29, 346–349 (2001)
Vert, J.-P., Saigo, H., Akustu, T.: Convolution and local alignment kernel. In: Schoelkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Compuatational Biology, MIT Press, Cambridge
Joachims, T.: Macking large scale svm learning practical. Technical Report LS8-24, Universitat Dortmond (1998)
Provost, F., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 423, 203–231 (2001)
Swet, J.: Measuring the accuracy of diagnostic systems. Science 240, 1285–1293 (1988)
Bhasin, M., Raghava, G.P.S.: Gpcrpred: an svm-based method for prediction of families and subfamilies of g-protein coupled receptors. Nucleaic Acids res. 32, 383–389 (2004)
Karchin, R., Karplus, K., Haussler, D.: Classifying g-protein coupled receptors with support vector machines. Bioinformatics 18(1), 147–159 (2002)
Huang, Y., Cai, J., Li, Y.D.: Classifying g-protein coupled receptors with bagging classification tree. Computationa Biology and Chemistry 28, 275–280 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Beigi, M.M., Zell, A. (2007). Synthetic Protein Sequence Oversampling Method for Classification and Remote Homology Detection in Imbalanced Protein Data. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_21
Download citation
DOI: https://doi.org/10.1007/978-3-540-71233-6_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71232-9
Online ISBN: 978-3-540-71233-6
eBook Packages: Computer ScienceComputer Science (R0)