Moment Vector Encoding of Protein Sequences for Supervised Classification

Altartouri, Haneen; Glasmachers, Tobias

doi:10.1007/978-3-030-23873-5_4

Haneen Altartouri¹⁹ &
Tobias Glasmachers¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1005 ))

Included in the following conference series:

International Conference on Practical Applications of Computational Biology & Bioinformatics

477 Accesses

Abstract

Automated prediction of biological attributes of protein sequences with machine learning methods depends on a well-suited protein representation. A central challenge is to represent variable-length sequences as fixed-length feature vectors. In this paper we introduce a new approach for representing the protein sequences as a fixed length vector based on statistical moments applied directly to the values of physicochemical properties of amino acids. The results show that this approach of encoding gives higher prediction accuracy on four benchmarks compared to the previous approaches that applied moments of complex descriptors extracted from the physicochemical properties, and even better than the PseAAC encoding method. The best results are achieved by removing highly correlated features with principal component analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FEPS: A Tool for Feature Extraction from Protein Sequence

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

Article Open access 16 May 2015

FEGS: a novel feature extraction model for protein sequences and its applications

Article Open access 03 June 2021

References

Almen, M., Nordström, K., Fredriksson, R., Schioth, H.: Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol. (2009)
Google Scholar
Alpaydın, E.: Introduction to Machine Learning. The Adaptive Computation and Machine Learning Series, 2nd edn. Massachusetts Institute of Technology (2010)
Google Scholar
Ayyash, M., Tamimi, H., Ashhab, Y.: Developing a powerful in Silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome. BMC Bioinf. (2012)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Cangelosi, R., Goriely, A.: Component retention in principal component analysis with application to cDNA microarray data. Biol. Dir. 2(2) (2007)
Google Scholar
Chou, C.: Prediction of protein cellular attributes using pseudo-amino-acid composition. In: PROTEINS: Structure, Function, and Genetic, pp. 246–255 (2001)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
Article Google Scholar
Georgiev, A.: Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16(5) (2009)
Article Google Scholar
Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, New York (2002)
MATH Google Scholar
Kumar, M., Gromiha, M.M., Raghava, G.P.S.: Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinf. 8 (2007)
Article Google Scholar
Liu, B., Xu, J., Lan, X., Xu, R., Zhou, J., Wang, X., Chou, K.C.: iDNA-Prot—dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE 9 (2014)
Article Google Scholar
Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2), 442–451 (1975)
Article Google Scholar
McKee, M., McKee, J.: Biochemistry: The Molecular Basis of Life, 5th edn. Oxford University Press, Oxford (2011)
MATH Google Scholar
Park, K., Gromiha, M., Horton, P., Suwa, M.: Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21, 223–229 (2005)
Google Scholar
Qu, K., Han, K., Wu, S., Wang, G., Wei, L.: Identification of DNA-binding proteins using mixed feature representation methods. Molecules 10 (2017)
Google Scholar
Rognvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics 31 (2015)
Article Google Scholar
Saidi, R., Maddouri, M., Nguifo, E.: Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinf. (2010)
Google Scholar
Singh, O., Chia-Yu, E.: Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinf. 17 (2016)
Google Scholar
Sun, D., Xu, C., Zhang, Y.: A novel method of 2D graphical representation for proteins and its application. Commun. Math. Comput. Chem. 75, 431–446 (2016)
MathSciNet Google Scholar
Yau, S.S.T., Yu, C., He, R.: A protein map and its application. DNA Cell Biol. 27 (2008)
Article Google Scholar
Zhou, X., Li, X., Li, M., Lu, X.: Predicting protein functional class with the weighted segmented pseudo-amino acid composition moment vector. Commun. Math. Comput. Chem. 66, 445–462 (2011)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Neural Computation, Ruhr-University Bochum, Bochum, Germany
Haneen Altartouri & Tobias Glasmachers

Authors

Haneen Altartouri
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Glasmachers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haneen Altartouri .

Editor information

Editors and Affiliations

Edificio Politécnico, Escuela Superior de Ingeniería Informática,Campus Universitario As Lagoas , Ourense, Spain
Florentino Fdez-Riverola
Department de Informática, Universidade do Minho, Braga, Portugal
Miguel Rocha
Faculty of Bioengineering and Technology, Universiti Malaysia Kelantan, Kelantan, Malaysia
Mohd Saberi Mohamad
Department of Computer Science and Software Engineering Leader, Data Science Research Group, College of Information Technology (CIT) United Arab Emirates University (UAEU), Al Ain, United Arab Emirates
Nazar Zaki
IBSAL/BISITE Research Group, University of Salamanca, Salamanca, Salamanca, Spain
José A. Castellanos-Garzón

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Altartouri, H., Glasmachers, T. (2020). Moment Vector Encoding of Protein Sequences for Supervised Classification. In: Fdez-Riverola, F., Rocha, M., Mohamad, M., Zaki, N., Castellanos-Garzón, J. (eds) Practical Applications of Computational Biology and Bioinformatics, 13th International Conference. PACBB 2019. Advances in Intelligent Systems and Computing, vol 1005 . Springer, Cham. https://doi.org/10.1007/978-3-030-23873-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-23873-5_4
Published: 22 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23872-8
Online ISBN: 978-3-030-23873-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics