The impact of pre-clustering on classification of heterogeneous protein data

Altartouri, Haneen; Tamimi, Hashem; Ashhab, Yaqoub

doi:10.1007/s13721-021-00336-0

The impact of pre-clustering on classification of heterogeneous protein data

Original Article
Published: 07 December 2021

Volume 11, article number 3, (2022)
Cite this article

Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

222 Accesses
Explore all metrics

Abstract

The aim of this paper is to evaluate improvement in the classification of protein sequence data by introducing clustering as a prepossessing step. Clustering analysis was introduced to discover any possible sub-clusters that might have different patterns within the same protein class. A classification learning algorithm is then applied to each cluster to enhance the classification accuracy. Two standard benchmark datasets: caspase 3 human substrates that include cleaved and non-cleaved peptides, and the membrane proteins inner and \(\alpha\)-helical proteins were used to examine the proposed approach. Different descriptors based on the physicochemical properties of amino acids were extracted from the protein sequence data and two encoding methods were used to represent the protein sequences using the descriptors. The results show that applying clustering process prior to classification gives higher prediction accuracy than using classification alone. In addition, the result of time performance shows that the proposed approach succeeded in reducing the training time of the classification process significantly while maintaining the accuracy of prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Review on Protein Structure Classification

ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

References

Abeel T, Peer Y, Saeys Y (2009) Java-ml: a machine learning library. J Mach Learn Res 10:931–934
MathSciNet MATH Google Scholar
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Article MathSciNet MATH Google Scholar
Awad M, Khan L, Bastani F, Yen I (2004) An effective support vector machines (svm) performance using hierarchical clustering. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence. pp 663–667
Ayyash M, Tamimi H, Ashhab Y (2012) Developing a powerful in silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome. BMC Bioinformatics
Bánhalmi A, Busa-Fekete R, Kégl B (2009) A one-class classification approach for protein sequences and structures. In: International symposium on bioinformatics research and applications. Springer, pp 310–322
Cervantes J, Li X, Yu W (2006) Support vector machine classication based on fuzzy clustering for large data sets. In: MICAI’06 proceedings of the 5th Mexican international conference on artificial intelligence. pp 572–582
Chou C (2001) Prediction of protein cellular attributes using pseudo-amino-acid composition. Proteins Struct Funct Genet 24:246–255
Article Google Scholar
Das S, Dawson NL, Orengo CA (2015) Diversity in protein domain superfamilies. Curr Opin Genet Dev 35:40–49
Article Google Scholar
Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27:861–874
Article Google Scholar
Gaddam Shekhar, Phoha Vir, Balagani Kiran (2007) K-means+id3: a novel method for supervised anomaly detection by cascading k-means clustering and id3 decision tree learning methods. Knowl Data Eng IEEE Trans 19:345–354
Article Google Scholar
Gao Q, Ye X, Jin Z, He J (2010) Improving discrimination of outer membrane proteins by fusing different forms of pseudo amino acid composition. Anal Biochem 398:52–59
Article Google Scholar
Georgiev A (2009) Interpretable numerical descriptors of amino acid space. J Comput Biol 16(5):703–23
Article Google Scholar
Gunn S (1998) Support vector machines for classification and regression. Tech Rep 14:5–16
Google Scholar
Hellberg S, Sjostrom M, Wold S (1986) The prediction of bradykinin potentiating potency of pentapeptides. an example of a peptide quantitative structure-activity relationship. Acta Chem Scand 40:135–140
Article Google Scholar
Huang Y, Kechadi T (2013) An effective hybrid learning system for telecommunication churn prediction. Expert Syst Appl 40:5635–5647
Article Google Scholar
Jain P, Hirst JD (2010) Automatic structure classification of small proteins using random forest. In: BMCBI
Kawashima S, Kanehisa M (1999) Aaindex: amino acid index database. Nucleic Acids Res 27:27–36
Article Google Scholar
Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: Proceedings of the ECML-PKDD discovery challenge workshop
Kyriakopoulou Antonia, Kalamboukis Theodore (2008) Combining clustering with classification for spam detection in social bookmarking systems. RSDC
Laskowski Roman A, Thornton Janet M, Sternberg Michael JE (2009) The fine details of evolution. Biochem Soc Trans 374:723–726
Article Google Scholar
Lingras P, West C (2004) Interval set clustering of web users with rough k-means. J Intell Inf Syst 23:5–16
Article MATH Google Scholar
Mathura V, Kolippakkam D (2005) Apdbase: amino acid physicochemical properties database. Bioinformation 1
McKee M, McKee J (2011) Biochemistry: the molecular basis of life, 5th edn. Oxford University Press, Oxford
MATH Google Scholar
Nanni L, Brahnam S, Lumini A (2010) High performance set of pseAAC and sequence based descriptors for protein classification. J Theoret Biol 266:1
Article MATH Google Scholar
Ohta T (2008) Gene families: multigene families and superfamilies. eLS
Ong S, Lin H, Chen Y, Li Z, Cao Z (2007) Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinf 8:1–4
Article Google Scholar
Park K, Gromiha M, Horton P, Suwa M (2005) Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21:223–229
Article Google Scholar
Prlic A, Yates A, Bliven S et al (2012) Biojava: an open-source framework for bioinformatics. Bioinformatics 28:2693–2695
Article Google Scholar
Rahideh A, Shaheed M (2011) Cancer classification using clustering based gene selection and artificial neural networks. In: 2nd International conference on control, instrumentation and automation (ICCIA)
Rajamohamed R, Manokaran J (2018) Improved credit card churn prediction based on rough clustering and supervised learning techniques. Cluster Comput 21:1–13
Article Google Scholar
Ray S, Kepler T (2007) Amino acid biophysical properties in the statistical prediction of peptide-MHC class i binding. Immunome Res 3:1–10
Article Google Scholar
Rojas R (1996) Neural networks: a systematic introduction. Springer, Berlin
Book MATH Google Scholar
Saidi R, Maddouri M, Nguifo E (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinf 11:1–3
Article Google Scholar
Sneath P (1996) Relations between chemical structure and biological activity in peptides. J Theor Biol 12:157–195
Article Google Scholar
The Mathworks (2021) Statistical toolbox 7.0. http://www.mathworks.com/help/stats/index.html
Tseng Yan Y, Li WH (2012) Classification of protein functional surfaces using structural characteristics. Proc Natl Acad Sci 1094:1170–1175
Article Google Scholar
Xiao J, Tian Y, Xie L, Huang J (2019) A hybrid classification framework based on clustering. IEEE Tran Ind Inf 8:1
Google Scholar
Xiong Y, Liu J, Zhang W, Zeng T (2012) Prediction of heme binding residues from protein sequences with integrative sequence profiles. Proteome Sci 10:1–8
Article Google Scholar
Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. ACM, Knowledge Discovery and Data Mining conference
Book Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Information Technology and Computer Engineering, Palestine Polytechnic University, Hebron, Palestine
Haneen Altartouri & Hashem Tamimi
Palestine-Korea Biotechnology Center, Palestine Polytechnic University, Hebron, Palestine
Yaqoub Ashhab

Authors

Haneen Altartouri
View author publications
You can also search for this author in PubMed Google Scholar
Hashem Tamimi
View author publications
You can also search for this author in PubMed Google Scholar
Yaqoub Ashhab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hashem Tamimi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altartouri, H., Tamimi, H. & Ashhab, Y. The impact of pre-clustering on classification of heterogeneous protein data. Netw Model Anal Health Inform Bioinforma 11, 3 (2022). https://doi.org/10.1007/s13721-021-00336-0

Download citation

Received: 01 April 2021
Revised: 09 August 2021
Accepted: 14 September 2021
Published: 07 December 2021
DOI: https://doi.org/10.1007/s13721-021-00336-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The impact of pre-clustering on classification of heterogeneous protein data

Abstract

Access this article

Similar content being viewed by others

A Review on Protein Structure Classification

ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The impact of pre-clustering on classification of heterogeneous protein data

Abstract

Access this article

Similar content being viewed by others

A Review on Protein Structure Classification

ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation