Abstract
Research in protein structure and function is one of the most important subjects in modern bioinformatics and computational biology. It often uses advanced data mining and machine learning methodologies to perform prediction or pattern recognition tasks. This paper describes a new method for prediction of protein secondary structure content based on feature selection and multiple linear regression. The method develops a novel representation of primary protein sequences based on a large set of 495 features. The feature selection task performed using very large set of nearly 6,000 proteins, and tests performed on standard non-homologues protein sets confirm high quality of the developed solution. The application of feature selection and the novel representation resulted in 14-15% error rate reduction when compared to results achieved when standard representation is used. The prediction tests also show that a small set of 5-25 features is sufficient to achieve accurate prediction for both helix and strand content for non-homologous proteins.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000)
Bussian, B., Sender, C.: How to Determine Protein Secondary Structure in Solution by Raman Spectroscopy: Practical Guide and Test Case DNsae I. Biochem. 28, 4271–4277 (1989)
Boeckmann, B., et al.: The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003. Nucleic Acids Research 31, 365–370 (2003)
Dwyer, D.: Electronic Properties of Amino Acids Side Chains Contribute to the Structural Preferences in Protein Folding. J. Bimolecular Structure & Dynamics 18(6), 881–892 (2001)
Eisenhaber, F., et al.: Prediction of Secondary Structural Contents of Proteins from Their Amino Acid Composition Alone, I. New Analytic Vector Decomposition Methods. Proteins 25(2), 157–168 (1996)
Ganapathiraju, M.K., et al.: Characterization of Protein Secondary Structure. IEEE Signal Processing Magazine, 78–87 (May 2004)
Hobohm, U., Sander, C.: A Sequence Property Approach to Searching Protein Databases. J. of Molecular Biology 251, 390–399 (1995)
Krigbaum, W., Knutton, S.: Prediction of the Amount of Secondary Structure in a Globular Protein from its Amino Acid Composition. Proc. of the Nat. Academy of Science 70, 2809–2813 (1973)
Lodish, H., et al.: Molecular Cell Biology, 4th edn., pp. 50–54. W.H. Freeman & Company, New York (2000)
Muskal, S.M., Kim, S.-H.: Predicting Protein Secondary Structure Content: a Tandem Neural Network Approach. J. of Molecular Biology 225, 713–727 (1992)
Nelson, D., Cox, M.: Lehninger Principles of Biochemistry Amino. Worth Publish., Belmont (2000)
Ruan, J., et al.: Highly Accurate and Consistent Method for Prediction of Helix and Strand Content from Primary Protein Sequences. Artificial Intelligence in Medicine, special issue on Computational Intelligence Techniques in Bioinformatics (accepted, 2005)
Sreerama, N., Woody, R.W.: Protein Secondary Structure from Circular Dichroism Spectroscopy. J. Molecular Biology 242, 497–507 (1994)
Syed, U., Yona, G.: Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Function. In: Proc. of RECOMB 2003 Conf., pp. 224–234 (2003)
Wang, J., et al.: Application of Neural Networks to Biological Data Mining: a Case Study in Protein Sequence Classification. In: Proc. of 6th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, pp. 305–309 (2000)
Yang, X., Wang, B.: Weave Amino Acid Sequences for Protein Secondary Structure Prediction. In: Proc. of 8th ACM SIGMOD workshop on Research issues in Data Mining and Knowledge Discovery, pp. 80–87 (2003)
Zhang, C.T., Zhang, Z., He, Z.: Prediction of the Secondary Structure of Globular Proteins Based on Structural Classes. J. of Protein Chemistry 15, 775–786 (1996)
Zhang, C.T., et al.: Prediction of Helix/Strand Content of Globular Proteins Based on Their Primary Sequences. Protein Engineering 11(11), 971–979 (1998a)
Zhang, C.T., Zhang, Z., He, Z.: Prediction of the Secondary Structure Contents of Globular Proteins based on Three Structural Classes. J. Protein Chemistry 17, 261–272 (1998b)
Zhang, Z.D., Sun, Z.R., Zhang, C.T.: A New Approach to Predict the Helix/Strand Content of Globular Proteins. J. Theoretical Biology 208, 65–78 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kurgan, L., Homaeian, L. (2005). Prediction of Secondary Protein Structure Content from Primary Sequence Alone – A Feature Selection Based Approach. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_33
Download citation
DOI: https://doi.org/10.1007/11510888_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26923-6
Online ISBN: 978-3-540-31891-0
eBook Packages: Computer ScienceComputer Science (R0)