Abstract
A challenging problem in Bioinformatics is to predict protein structure, properties, activities or interactions from their aminoacid sequences. Sequence-derived physicochemical features of proteins have been used to support the development of Machine Learning (ML) models. However, tools and platforms to calculate features from protein sequences and train ML models are scarce and have limitations in terms of performance, user-friendliness and domains of application.
Here, a generic modular semi-automated platform for the classification of proteins based on their physicochemical properties using ML is proposed. The tool, developed as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate several types of protein descriptors, pre-process datasets, execute feature selection and dimensionality reduction, perform clustering, train and optimize ML models and make predictions with 8 different algorithms. ProPythia has an adaptable modular architecture being a versatile and easy-to-use tool to apply ML analysis over protein sequences. This platform was tested in the classification of membrane active anticancer and antimicrobial peptides. The package, its source code and documentation, including an user guide and case studies freely available at https://github.com/BioSystemsUM/propythia, it can also be installed through ‘pip install propythia’.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Awad, M., Khanna, R.: Efficient Learning Machines. Apress Media (2015)
Bhadra, P., Yan, J., Li, J., Fong, S., Siu, S.W.I.: AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep. 8(1), 1–10 (2018)
Cao, D.S., et al.: PyDPI: freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies. J. Chem. Inf. Model. 53(11), 3086–3096 (2013)
Cao, D.S., Xu, Q.S., Liang, Y.Z.: Propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29(7), 960–962 (2013)
Chen, Z., et al.: iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34(14), 2499–2502 (2018)
Chen, Z., et al.: iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Briefings in Bioinform. (2019)
Dong, J., et al.: PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J. Cheminformatics. 10(1), 16 (2018)
Lee, E.Y., Fulan, B.M., Wong, G.C.L., Ferguson, A.L.: Mapping membrane activity in undiscovered peptide sequence space using machine learning. 113(48), 13588–13593 (2016)
Liu, B.: BioSeq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings in Bioinform. 1–15 (2017)
Manavalan, B., et al.: MLACP: machine-learning-based prediction of anticancer peptides. Oncotarget 8(44), 77121–77136 (2017)
Müller, A.T., Gabernet, G., Hiss, J.A., Schneider, G.: modlAMP: Python for antimicrobial peptides. Bioinformatics (Oxford, England) 33(17), 2753–2755 (2017)
Pande, A., et al.: Computing wide range of protein/peptide features from their sequence and structure. bioRxiv p. 599126 (2019)
Acknowledgments
This study was supported by FCT through project PTDC/CCI-BIO/28200/2017 and the strategic funding of UID/BIO/04469/2020, and also by the European Regional Development Fund under the scope of Norte2020, through the projects DeepBio (ref. NORTE-01-0247-FEDER-039831). This work was also financially supported by Project LISBOA-01-0145-FEDER-007660 (Microbiologia Molecular, Estrutural e Celular) funded by FEDER funds through COMPETE2020 - Programa Operacional Competitividade e Internacionalização (POCI) and by national funds through FCT - Fundação para a Ciência e a Tecnologia.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sequeira, A.M., Lousa, D., Rocha, M. (2021). ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning. In: Panuccio, G., Rocha, M., Fdez-Riverola, F., Mohamad, M., Casado-Vara, R. (eds) Practical Applications of Computational Biology & Bioinformatics, 14th International Conference (PACBB 2020). PACBB 2020. Advances in Intelligent Systems and Computing, vol 1240. Springer, Cham. https://doi.org/10.1007/978-3-030-54568-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-54568-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54567-3
Online ISBN: 978-3-030-54568-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)