Skip to main content

Advertisement

Log in

The impact of pre-clustering on classification of heterogeneous protein data

  • Original Article
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

The aim of this paper is to evaluate improvement in the classification of protein sequence data by introducing clustering as a prepossessing step. Clustering analysis was introduced to discover any possible sub-clusters that might have different patterns within the same protein class. A classification learning algorithm is then applied to each cluster to enhance the classification accuracy. Two standard benchmark datasets: caspase 3 human substrates that include cleaved and non-cleaved peptides, and the membrane proteins inner and \(\alpha\)-helical proteins were used to examine the proposed approach. Different descriptors based on the physicochemical properties of amino acids were extracted from the protein sequence data and two encoding methods were used to represent the protein sequences using the descriptors. The results show that applying clustering process prior to classification gives higher prediction accuracy than using classification alone. In addition, the result of time performance shows that the proposed approach succeeded in reducing the training time of the classification process significantly while maintaining the accuracy of prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Abeel T, Peer Y, Saeys Y (2009) Java-ml: a machine learning library. J Mach Learn Res 10:931–934

    MathSciNet  MATH  Google Scholar 

  • Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79

    Article  MathSciNet  MATH  Google Scholar 

  • Awad M, Khan L, Bastani F, Yen I (2004) An effective support vector machines (svm) performance using hierarchical clustering. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence. pp 663–667

  • Ayyash M, Tamimi H, Ashhab Y (2012) Developing a powerful in silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome. BMC Bioinformatics

  • Bánhalmi A, Busa-Fekete R, Kégl B (2009) A one-class classification approach for protein sequences and structures. In: International symposium on bioinformatics research and applications. Springer, pp 310–322

  • Cervantes J, Li X, Yu W (2006) Support vector machine classication based on fuzzy clustering for large data sets. In: MICAI’06 proceedings of the 5th Mexican international conference on artificial intelligence. pp 572–582

  • Chou C (2001) Prediction of protein cellular attributes using pseudo-amino-acid composition. Proteins Struct Funct Genet 24:246–255

    Article  Google Scholar 

  • Das S, Dawson NL, Orengo CA (2015) Diversity in protein domain superfamilies. Curr Opin Genet Dev 35:40–49

    Article  Google Scholar 

  • Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27:861–874

    Article  Google Scholar 

  • Gaddam Shekhar, Phoha Vir, Balagani Kiran (2007) K-means+id3: a novel method for supervised anomaly detection by cascading k-means clustering and id3 decision tree learning methods. Knowl Data Eng IEEE Trans 19:345–354

    Article  Google Scholar 

  • Gao Q, Ye X, Jin Z, He J (2010) Improving discrimination of outer membrane proteins by fusing different forms of pseudo amino acid composition. Anal Biochem 398:52–59

    Article  Google Scholar 

  • Georgiev A (2009) Interpretable numerical descriptors of amino acid space. J Comput Biol 16(5):703–23

    Article  Google Scholar 

  • Gunn S (1998) Support vector machines for classification and regression. Tech Rep 14:5–16

    Google Scholar 

  • Hellberg S, Sjostrom M, Wold S (1986) The prediction of bradykinin potentiating potency of pentapeptides. an example of a peptide quantitative structure-activity relationship. Acta Chem Scand 40:135–140

    Article  Google Scholar 

  • Huang Y, Kechadi T (2013) An effective hybrid learning system for telecommunication churn prediction. Expert Syst Appl 40:5635–5647

    Article  Google Scholar 

  • Jain P, Hirst JD (2010) Automatic structure classification of small proteins using random forest. In: BMCBI

  • Kawashima S, Kanehisa M (1999) Aaindex: amino acid index database. Nucleic Acids Res 27:27–36

    Article  Google Scholar 

  • Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: Proceedings of the ECML-PKDD discovery challenge workshop

  • Kyriakopoulou Antonia, Kalamboukis Theodore (2008) Combining clustering with classification for spam detection in social bookmarking systems. RSDC

  • Laskowski Roman A, Thornton Janet M, Sternberg Michael JE (2009) The fine details of evolution. Biochem Soc Trans 374:723–726

    Article  Google Scholar 

  • Lingras P, West C (2004) Interval set clustering of web users with rough k-means. J Intell Inf Syst 23:5–16

    Article  MATH  Google Scholar 

  • Mathura V, Kolippakkam D (2005) Apdbase: amino acid physicochemical properties database. Bioinformation 1

  • McKee M, McKee J (2011) Biochemistry: the molecular basis of life, 5th edn. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Nanni L, Brahnam S, Lumini A (2010) High performance set of pseAAC and sequence based descriptors for protein classification. J Theoret Biol 266:1

    Article  MATH  Google Scholar 

  • Ohta T (2008) Gene families: multigene families and superfamilies. eLS

  • Ong S, Lin H, Chen Y, Li Z, Cao Z (2007) Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinf 8:1–4

    Article  Google Scholar 

  • Park K, Gromiha M, Horton P, Suwa M (2005) Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21:223–229

    Article  Google Scholar 

  • Prlic A, Yates A, Bliven S et al (2012) Biojava: an open-source framework for bioinformatics. Bioinformatics 28:2693–2695

    Article  Google Scholar 

  • Rahideh A, Shaheed M (2011) Cancer classification using clustering based gene selection and artificial neural networks. In: 2nd International conference on control, instrumentation and automation (ICCIA)

  • Rajamohamed R, Manokaran J (2018) Improved credit card churn prediction based on rough clustering and supervised learning techniques. Cluster Comput 21:1–13

    Article  Google Scholar 

  • Ray S, Kepler T (2007) Amino acid biophysical properties in the statistical prediction of peptide-MHC class i binding. Immunome Res 3:1–10

    Article  Google Scholar 

  • Rojas R (1996) Neural networks: a systematic introduction. Springer, Berlin

    Book  MATH  Google Scholar 

  • Saidi R, Maddouri M, Nguifo E (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinf 11:1–3

    Article  Google Scholar 

  • Sneath P (1996) Relations between chemical structure and biological activity in peptides. J Theor Biol 12:157–195

    Article  Google Scholar 

  • The Mathworks (2021) Statistical toolbox 7.0. http://www.mathworks.com/help/stats/index.html

  • Tseng Yan Y, Li WH (2012) Classification of protein functional surfaces using structural characteristics. Proc Natl Acad Sci 1094:1170–1175

    Article  Google Scholar 

  • Xiao J, Tian Y, Xie L, Huang J (2019) A hybrid classification framework based on clustering. IEEE Tran Ind Inf 8:1

    Google Scholar 

  • Xiong Y, Liu J, Zhang W, Zeng T (2012) Prediction of heme binding residues from protein sequences with integrative sequence profiles. Proteome Sci 10:1–8

    Article  Google Scholar 

  • Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. ACM, Knowledge Discovery and Data Mining conference

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hashem Tamimi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Altartouri, H., Tamimi, H. & Ashhab, Y. The impact of pre-clustering on classification of heterogeneous protein data. Netw Model Anal Health Inform Bioinforma 11, 3 (2022). https://doi.org/10.1007/s13721-021-00336-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-021-00336-0

Keywords

Navigation