Abstract
Automated prediction of biological attributes of protein sequences with machine learning methods depends on a well-suited protein representation. A central challenge is to represent variable-length sequences as fixed-length feature vectors. In this paper we introduce a new approach for representing the protein sequences as a fixed length vector based on statistical moments applied directly to the values of physicochemical properties of amino acids. The results show that this approach of encoding gives higher prediction accuracy on four benchmarks compared to the previous approaches that applied moments of complex descriptors extracted from the physicochemical properties, and even better than the PseAAC encoding method. The best results are achieved by removing highly correlated features with principal component analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Almen, M., Nordström, K., Fredriksson, R., Schioth, H.: Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol. (2009)
Alpaydın, E.: Introduction to Machine Learning. The Adaptive Computation and Machine Learning Series, 2nd edn. Massachusetts Institute of Technology (2010)
Ayyash, M., Tamimi, H., Ashhab, Y.: Developing a powerful in Silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome. BMC Bioinf. (2012)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cangelosi, R., Goriely, A.: Component retention in principal component analysis with application to cDNA microarray data. Biol. Dir. 2(2) (2007)
Chou, C.: Prediction of protein cellular attributes using pseudo-amino-acid composition. In: PROTEINS: Structure, Function, and Genetic, pp. 246–255 (2001)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
Georgiev, A.: Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16(5) (2009)
Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, New York (2002)
Kumar, M., Gromiha, M.M., Raghava, G.P.S.: Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinf. 8 (2007)
Liu, B., Xu, J., Lan, X., Xu, R., Zhou, J., Wang, X., Chou, K.C.: iDNA-Prot—dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE 9 (2014)
Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2), 442–451 (1975)
McKee, M., McKee, J.: Biochemistry: The Molecular Basis of Life, 5th edn. Oxford University Press, Oxford (2011)
Park, K., Gromiha, M., Horton, P., Suwa, M.: Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21, 223–229 (2005)
Qu, K., Han, K., Wu, S., Wang, G., Wei, L.: Identification of DNA-binding proteins using mixed feature representation methods. Molecules 10 (2017)
Rognvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics 31 (2015)
Saidi, R., Maddouri, M., Nguifo, E.: Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinf. (2010)
Singh, O., Chia-Yu, E.: Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinf. 17 (2016)
Sun, D., Xu, C., Zhang, Y.: A novel method of 2D graphical representation for proteins and its application. Commun. Math. Comput. Chem. 75, 431–446 (2016)
Yau, S.S.T., Yu, C., He, R.: A protein map and its application. DNA Cell Biol. 27 (2008)
Zhou, X., Li, X., Li, M., Lu, X.: Predicting protein functional class with the weighted segmented pseudo-amino acid composition moment vector. Commun. Math. Comput. Chem. 66, 445–462 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Altartouri, H., Glasmachers, T. (2020). Moment Vector Encoding of Protein Sequences for Supervised Classification. In: Fdez-Riverola, F., Rocha, M., Mohamad, M., Zaki, N., Castellanos-Garzón, J. (eds) Practical Applications of Computational Biology and Bioinformatics, 13th International Conference. PACBB 2019. Advances in Intelligent Systems and Computing, vol 1005 . Springer, Cham. https://doi.org/10.1007/978-3-030-23873-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-23873-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23872-8
Online ISBN: 978-3-030-23873-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)