Abstract
With the explosive growth in the number of protein sequences generated in the postgenomic age, research into identifying cytokines from proteins and detecting their biochemical mechanisms becomes increasingly important. Unfortunately, the identification of cytokines from proteins is challenging due to a lack of understanding of the structure space provided by the proteins and the fact that only a small number of cytokines exists in massive proteins. In view of fact that a proteins sequence is conceptually similar to a mapping of words to meaning, n-gram, a type of probabilistic language model, is explored to extract features for proteins. The second challenge focused on in this work is genetic algorithms, a search heuristic that mimics the process of natural selection, that is utilized to develop a classifier for overcoming the protein imbalance problem to generate precise prediction of cytokines in proteins. Experiments carried on imbalanced proteins data set show that our methods outperform traditional algorithms in terms of the prediction ability.
Similar content being viewed by others
References
Zou Q, Li X, Jiang Y, Zhao Y, Wang G. BinMemPredict: a Web server and software for predicting membrane protein types. Current Proteomics, 2013, 10(1): 2–9
Yabuki Y, Muramatsu T, Hirokawa T, Mukai H, Suwa M. GRIFFIN: a system for predicting GPCR-G-protein coupling selectivity using a support vector machine and a hidden Markov model. Nucleic AcidsResearch, 2005, 33(suppl 2): W148–W153
Nielsen H, Engelbrecht J, Brunak S, Heijne G V. A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. International Journal of Neural Systems, 1997, 8(5–6): 581–599
Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. Basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3): 403–410
Pearson W R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics, 1991, 11(3): 635–650
Huang N, Chen H, Sun Z. CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. Protein Engineering Design and Selection, 2005, 18(8): 365–368
Liu B, Wang X, Lin L, Tang B, Dong Q, Wang X. Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC bioinformatics, 2009, 10(1): 381
Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q. Hierarchical classification of protein folds using a novel ensemble classifier. PloS one, 2013, 8(2): e56499
Zou Q, Chen W, Huang Y, Liu X, Jiang Y. Identifying multi-functional enzyme by hierarchical multi-label classifier. Journal of Computational and Theoretical Nanoscience, 2013, 10(4): 1038–1043
Chou K C, Shen H B. Recent advances in developing web-servers for predicting protein attributes. Natural Science, 2009, 1(2): 63–92
Ganapathiraju M, Weisser D, Rosenfeld R, Carbonell J, Reddy R, Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In: Proceedings of the 2nd International Conference on Human Language Technology Research. 2002, 76–81
Srinivasan S M, Vural S, King B R, Guda C. Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics, 2013, 14(1): 96
Koza J R. Genetic Programming. MIT press, 1992
Sun Y, Kamel M S, Wong A K, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40(12): 3358–3378
Lewis D, Gale W. Training text classifiers by uncertainty sampling. In: Proceedings of the 14th ACM SIGIR Conference on Research and Development in Information Retrieval. 1994.
Kubat M, Holte R C, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine learning, 1998, 30(2–3): 195–215
Fawcett T. An introduction to ROC analysis. Pattern recognition letters, 2006, 27(8): 861–874
Provost F J, Fawcett T. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. 1997, 97: 43–48
Bateman A, Coin L, Durbin R, Finn R D, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E L L, Studholme D J, Yeats C, Eddy, S. R. The Pfam protein families database. Nucleic Acids Research, 2004, 32: D138–D141
Author information
Authors and Affiliations
Corresponding author
Additional information
Xiangxiang Zeng received his BS degree in automation from Hunan University, China in 2005, and his PhD in systems engineering from Huazhong University of Science and Technology, China in 2011. From 2010 to 2011 he spent one year working in the group of natural computing in Seville University, Spain. Currently, he is an assistant professor in the Department of Computer Science, Xiamen University, China. His main research interests include membrane computing, neural computing and automaton theory.
Sisi Yuan is a Master student of the Department of Computer Science at Xiamen University, China. She received her BS degree in software engineering from Hangzhou Dianzi University, China. Her research interests include data mining and bioinformatics.
Xianxian Huang is an undergraduate student of the Department of Computer Science at Xiamen University, China. His main research interests are data mining and bioinformatics.
Quan Zou is an associate professor of computer science at Xiamen University, China. He received his PhD degree from Harbin Institute of Technology, China in 2009. His research is in the areas of bioinformatics, machine learning and parallel computing. Now his focus is on genome assembly, annotation, and functional analysis from next generation sequencing data with parallel computing methods. Several related works have been published in Briefings in Bioinformatics, Bioinformatics, PLOS ONE, and IEEE/ACMTransactions on Computational Biology and Bioinformatics. He serves on many impactful journals and the National Natural Science Foundation of China.
Rights and permissions
About this article
Cite this article
Zeng, X., Yuan, S., Huang, X. et al. Identification of cytokine via an improved genetic algorithm. Front. Comput. Sci. 9, 643–651 (2015). https://doi.org/10.1007/s11704-014-4089-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-014-4089-3