Abstract
The One-Class Classification (OCC) approach is based on the assumption that samples are available only from a target class in the training phase. OCC methods have been applied with success to problems where the classes are very different in size. As class-imbalance problems are typical in protein classification tasks, we were interested in testing one-class classification algorithms for the detection of distant similarities in protein sequences and structures. We found that the OCC approach brought about a small improvement in classification performance compared to binary classifiers (SVM, ANN, Random Forest). More importantly, there is a substantial (50 to 100 fold) improvement in the training time. OCCs may provide an especially useful alternative for processing those protein groups where discriminative classifiers cannot be easily trained.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chen, Y., Zhou, X.S., Huang, T.S.: One-class SVM for learning in image retrieval. In: 2001 International Conference on Image Processing proc., vol. 1, pp. 34–37 (2001)
Shin, H.J., Eom, D.-H., Kim, S.-S.: One-class support vector machines: an application in machine fault detection and classification. Comput. Ind. Eng. 48(2), 395–408 (2005)
He, C., Girolami, M., Ross, G.: Employing optimised combinations of one-class classifiers for automated currency validation. Pattern Recognition 37, 1085–1096 (2004)
Sachs, A., Thiel, C., Schwenker, F.: One-class support-vector machines for the classification of bioacoustic time series. ICGST International Journal on Artificial Intelligence and Machine Learning (AIML) 6(4), 29–34 (2006)
Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk E-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05 (1998)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Son, New York (2001)
Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)
Parzen, E.: On the estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065–1076 (1962)
Japkowicz, N., Myers, C., Gluck, M.A.: A novelty detection approach to classification. In: IJCAI, pp. 518–523 (1995)
Ypma, A., Duin, R.: Support objects for domain approximation (1998)
Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Computation 13(7), 1443–1471 (2001)
Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recogn. Lett. 20(11-13), 1191–1199 (1999)
Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004)
Tax, D.M.J.: One-class classification; Concept-learning in the absence of counter-examples. Ph.D thesis, Delft University of Technology (2001)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)
Holm, L., Park, J.: Dalilite workbench for protein structure comparison. Bioinformatics (16), 566–567 (2000)
Vlahovicek, K., Gaspari, Z., Pongor, S.: Efficient recognition of folds in protein 3d structures by the improved pride algorithm. Bioinformatics (21), 3322–3323 (2005)
Vapnik, V.N.: Statistical Learning Theory. John Wiley and Son, Chichester (1998)
Breiman, L.: Random forests. Machine Learning V45(1), 5–32 (2001)
Sonego, P., Pacurar, M., Dhir, S., Kertész-Farkas, A., Kocsor, A., Gáspari, Z., Leunissen, A.M., Pongor, S.: A protein classification benchmark collection for machine learning. Nucleic Acids Research 35(suppl. 1), D232–D236 (2007)
Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: The cog database: an updated version includes eukaryotes. BMC Bioinformatics 4 (September 2003)
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: RECOMB 2002: Proceedings of the sixth annual international conference on Computational biology, pp. 225–232. ACM Press, New York (2002)
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G.: Scop database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 32(Database issue) (January 2004)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U S A 89(22), 10915–10919 (1992)
Vlahovicek, K., Kajan, L., Agoston, V., Pongor, S.: The sbase domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines. Nucleic Acids Research 33(suppl. 1), 223 (2005)
Murvai, J., Vlahovicek, K., Szepesvári, C., Pongor, S.: Prediction of protein functional domains from sequences using artificial neural networks. Genome Res. 11, 1410–1417 (2001)
Paalanen, P.: Bayesian classification using Gaussian mixture model and EM estimation: Implementations and comparisons. Technical report, Department of Information Technology, Lappeenranta University of Technology, Lappeenranta (2004)
Allinson, N.M., Yin, H.: Self-organising maps for pattern recognition. In: Oja, E., Kaski, S. (eds.) Kohonen Maps, pp. 111–120. Elsevier, Amsterdam (1999)
Bánhalmi, A., Kocsor, A., Busa-Fekete, R.: Counter-example generation-based one-class classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 543–550. Springer, Heidelberg (2007)
Bánhalmi, A.: One-class classification methods via automatic counter-example generation. In: AIAP 2008: Proceedings of the 26th IASTED International Multi-Conference, Anaheim, CA, USA. ACTA Press (2008)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2005)
Joachims, T.: Making large-scale support vector machine learning practical. MIT Press, Cambridge (1998)
Egan, J.P.: Signal Detection theory and ROC Analysis. Academic Press, New York (1975)
Sonego, P., Kocsor, A., Pongor, S.: Roc analysis: applications to the classification of biological sequences and 3d structures. Brief Bioinform. (January 2008)
Gribskov, M., Robinson, N.: Use of receiver operating characteristic (roc) analysis to evaluate sequence matching (1996)
Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization (2004)
Ingleby, J.D.: Signal detection theory and psychophysics. Journal of Sound Vibration 5, 519–521 (1967)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bánhalmi, A., Busa-Fekete, R., Kégl, B. (2009). A One-Class Classification Approach for Protein Sequences and Structures. In: Măndoiu, I., Narasimhan, G., Zhang, Y. (eds) Bioinformatics Research and Applications. ISBRA 2009. Lecture Notes in Computer Science(), vol 5542. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01551-9_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-01551-9_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01550-2
Online ISBN: 978-3-642-01551-9
eBook Packages: Computer ScienceComputer Science (R0)