Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

Jha, Preeti; Tiwari, Aruna; Bharill, Neha; Ratnaparkhe, Milind; Patel, Om Prakash; Harshith, Nilagiri; Mounika, Mukkamalla; Nagendra, Neha

doi:10.1007/s41060-022-00381-6

Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

Regular Paper
Published: 03 January 2023

Volume 15, pages 359–378, (2023)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Preeti Jha¹,
Aruna Tiwari¹,
Neha Bharill²,
Milind Ratnaparkhe³,
Om Prakash Patel²,
Nilagiri Harshith²,
Mukkamalla Mounika¹ &
…
Neha Nagendra¹

349 Accesses
1 Altmetric
Explore all metrics

Abstract

Genome sequencing projects are rapidly contributing to the rise of high-dimensional protein sequence datasets. Extracting features from a high-dimensional protein sequence dataset poses many challenges. However, many features extraction methods exist, but extracting features from millions of protein sequences becomes impractical because these approaches are not scalable. Therefore, to design an efficient scalable feature extraction approach that extracts significant features, we have proposed two Apache Spark-based scalable feature extraction approaches that extracts significantly important features based on statistical properties from huge protein sequences, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in four clustering algorithms, i.e., scalable random sampling with iterative optimization fuzzy c-means (SRSIO-FCM), scalable literal fuzzy c-means (SLFCM), kernelized SRSIO-FCM (KSRSIO-FCM), and kernelized SLFCM (KSLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies–Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms achieve significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

A novel apache spark-based 14-dimensional scalable feature extraction approach for the clustering of genomics data

Article 05 September 2023

Improved Feature Selection Algorithm for Biological Sequences Classification

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Guo, R., Zhao, Y., Zou, Q., Fang, X., Peng, S.: Bioinformatics applications on apache spark. GigaScience 7(8), giy098 (2018)
Google Scholar
Alawneh, L., Shehab, M.A., Al-Ayyoub, M., Jararweh, Y., Al-Sharif, Z.A.: A scalable multiple pairwise protein sequence alignment acceleration using hybrid cpu-gpu approach. Clust. Comput. 23(4), 2677–2688 (2020)
Article Google Scholar
Krause, A., Stoye, J., Vingron, M.: Large scale hierarchical clustering of protein sequences. BMC Bioinform. 6(1), 15 (2005)
Article Google Scholar
Zou, Q., Lin, G., Jiang, X., Liu, X., Zeng, X.: Sequence clustering in bioinformatics: an empirical study. Brief. Bioinform. 21(1), 1–10 (2020)
Google Scholar
Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 1–8 (2018)
Article Google Scholar
Zeng, M., Zhang, F., Wu, F.X., Li, Y., Wang, J., Li, M.: Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 36(4), 1114–1120 (2020)
Article Google Scholar
Han, K.F., Baker, D.: Recurring local sequence motifs in proteins. J. Mol. Biol. 251(1), 176–187 (1995)
Article Google Scholar
Bystroff, C., Thorsson, V., Baker, D.: Hmmstr: a hidden markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301(1), 173–190 (2000)
Article Google Scholar
Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mounika, M., Nagendra, N.: A novel scalable kernelized fuzzy clustering algorithms based on in-memory computation for handling big data. IEEE Trans. Emerg. Topic. Comput. Intell. 5, 908–919 (2020)
Article Google Scholar
Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mounika, M., Nagendra, N.: Apache spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis. Comput. Biol. Chem 92, 107454 (2021)
Article Google Scholar
Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Nagendra, N., Mounika, M.: Scalable incremental fuzzy consensus clustering algorithm for handling big data. Soft. Comput. pp 1–17 (2021b)
Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. & Geosci. 10(2–3), 191–203 (1984)
Article Google Scholar
Zhang, C.T., Chou, K.C., Maggiora, G.: Predicting protein structural classes from amino acid composition: application of fuzzy clustering. Protein Eng. Des. Select. 8(5), 425–435 (1995)
Article Google Scholar
Lu, T., Dou, Y., Zhang, C.: Fuzzy clustering of cpp family in plants with evolution and interaction analyses. BMC Bioinform. 14(S13), S10 (2013)
Article Google Scholar
Farhangi, E., Ghadiri, N., Asadi, M., Nikbakht, MA., Pitre, S.: Fast and scalable protein motif sequence clustering based on hadoop framework. In: 2017 3th International Conference on Web Research (ICWR), IEEE, pp 24–31 (2017)
Chunduri, R.K., Cherukuri, A.K.: Scalable formal concept analysis algorithms for large datasets using spark. J. Ambient Intell. Humaniz. Comput. pp 1–21 (2018)
Oussous, A., Benjelloun, F.Z., Lahcen, A.A., Belfkih, S.: Big data technologies: a survey. J. King Saud Univ Comput Inform Sci 30(4), 431–448 (2018)
Google Scholar
Bharill, N., Tiwari, A., Malviya, A.: Fuzzy based scalable clustering algorithms for handling big data using apache spark. IEEE Trans. Big Data 2(4), 339–352 (2016)
Article Google Scholar
Vipsita, S., Rath, S.K.: Two-stage approach for protein superfamily classification. Comput. Biol. J. 2013 (2013)
Wang, J.T.L., Ma, Q., Shasha, D., Wu, C.H.: New techniques for extracting features from protein sequences. IBM Syst. J. 40(2), 426–441 (2001)
Article Google Scholar
Wu, C., Whitson, G., McLarty, J., Ermongkonchai, A., Chang, T.C.: Protein classification artificial neural system. Protein Sci. 1(5), 667–677 (1992)
Article Google Scholar
Dayhoff, M., Schwartz, R., Orcutt, B.: 22 a model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, vol. 5, pp. 345–352. National Biomedical Research Foundation Silver Spring, MD (1978)
Google Scholar
Das, J.K., Sengupta, A., Choudhury, P.P., Roy, S.: Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis. Gene 766, 145096 (2021)
Article Google Scholar
Bandyopadhyay, S.: An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst. 152(1), 5–16 (2005)
Article MathSciNet MATH Google Scholar
Mansoori, E.G., Zolghadri, M.J., Katebi, S.D., Mohabatkar, H., Boostani, R., Sadreddini, M.H.: Generating fuzzy rules for protein classification. Iran. J. Fuzzy Syst 5(2), 21–33 (2008)
MathSciNet MATH Google Scholar
Chou, K.C.: Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273(1), 236–247 (2011)
Article MathSciNet MATH Google Scholar
Chou, KC.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)
Yu, C., Deng, M., Cheng, S.Y., Yau, S.C., He, R.L., Yau, S.S.T.: Protein space: a natural method for realizing the nature of protein universe. J. Theor. Biol. 318, 197–204 (2013)
Article MATH Google Scholar
Gupta, M., Niyogi, R., Misra, M.: An alignment-free method to find similarity among protein sequences via the general form of chou’s pseudo amino acid composition. SAR and QSAR in Environ. Res. 24(7), 597–609 (2013)
Article Google Scholar
Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Protein. Struct. Funct. Bioinform. 43(3), 246–255 (2001)
Article Google Scholar
Bharill, N., Tiwari, A., Rawat, A.: A novel technique of feature extraction with dual similarity measures for protein sequence classification. Proced. Comput. Sci. 48, 795–801 (2015)
Article Google Scholar
Mansoori, E.G., Zolghadri, M.J., Katebi, S.D.: Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans. NanoBiosci. 8(1), 92–99 (2009)
Article Google Scholar
Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp 424–431 (2016)
Li, R., Hu, H., Li, H., Wu, Y., Yang, J.: Mapreduce parallel programming model: a state-of-the-art survey. Int. J. Parallel Program. 44(4), 832–866 (2016)
Article Google Scholar
Le Nir, Y.: Spark and machine learning library. TORUS 1–Toward an open resource using services: Cloud Comput. Environ. Data pp 229–243 (2020)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association, pp 2–2 (2012)
Tang, S., He, B., Yu, C., Li, Y., Li, K.: A survey on spark ecosystem for big data processing. (2018) arXiv preprint arXiv:1811.08834
Chou, K.C.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)
Article Google Scholar
Dayhoff, M.O.: A model of evolutionary change in proteins. Atlas Prot. Seq. Struct. 5, 89–99 (1972)
Google Scholar
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)
Article Google Scholar
Borthakur, D., et al.: Hdfs architecture guide. Hado. Apac. Proj. 53(1–13), 2 (2008)
Google Scholar
Wysmierski, P.T., Vello, N.A.: The genetic base of brazilian soybean cultivars: evolution over time and breeding implications. Gene. Mole. Biol. 36(4), 547–555 (2013)
Article Google Scholar
Sedivy, E.J., Wu, F., Hanzawa, Y.: Soybean domestication: the origin, genetic architecture and molecular bases. New Phytol. 214(2), 539–553 (2017)
Article Google Scholar
Lee, J.D., Shannon, J.G., Vuong, T.D., Nguyen, H.T.: Inheritance of salt tolerance in wild soybean (glycine soja sieb. and zucc.) accession pi483463. J. Hered. 100(6), 798–801 (2009)
Xie, M., Chung, C.Y.L., Li, M.W., Wong, F.L., Wang, X., Liu, A., Wang, Z., Leung, A.K.Y., Wong, T.H., Tong, S.W., et al.: A reference-grade wild soybean genome. Nat. Commun. 10(1), 1–12 (2019)
Article Google Scholar
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Signal Process. 83(4), 825–833 (2003)
Article MATH Google Scholar
Dugué, N., Lamirel, J.C., Chen, Y.: Evaluating clustering quality using features salience: a promising approach. Neural Comput. Appl. 33(19), 12939–12956 (2021)
Article Google Scholar
Coelho, G.P., Barbante, C.C., Boccato, L., Attux, R.R., Oliveira, J.R., Von Zuben, F.J.: Automatic feature selection for bci: an analysis using the davies-bouldin index and extreme learning machines. In: The 2012 international joint conference on neural networks (IJCNN), IEEE, pp 1–8 (2012)
Shen, H.B., Yang, J., Liu, X.J., Chou, K.C.: Using supervised fuzzy clustering to predict protein structural classes. Biochem. Biophys. Res. Commun. 334(2), 577–581 (2005)
Article Google Scholar

Download references

Acknowledgements

This work is supported by National Supercomputing Mission, HPC Applications Development Funded Research Project by DST in collaboration with the Ministry of Electronics and Information Technology (MeiTY), Govt. of India.

Author information

Authors and Affiliations

Indian Institute of Technology, Indore, India
Preeti Jha, Aruna Tiwari, Mukkamalla Mounika & Neha Nagendra
Mahindra University, Hyderabad, India
Neha Bharill, Om Prakash Patel & Nilagiri Harshith
ICAR-Indian Institute of Soybean Research, Indore, India
Milind Ratnaparkhe

Authors

Preeti Jha
View author publications
You can also search for this author inPubMed Google Scholar
Aruna Tiwari
View author publications
You can also search for this author inPubMed Google Scholar
Neha Bharill
View author publications
You can also search for this author inPubMed Google Scholar
Milind Ratnaparkhe
View author publications
You can also search for this author inPubMed Google Scholar
Om Prakash Patel
View author publications
You can also search for this author inPubMed Google Scholar
Nilagiri Harshith
View author publications
You can also search for this author inPubMed Google Scholar
Mukkamalla Mounika
View author publications
You can also search for this author inPubMed Google Scholar
Neha Nagendra
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Preeti Jha.

Ethics declarations

Conflicts of interest

All authors declare that there are no conflicts of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jha, P., Tiwari, A., Bharill, N. et al. Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis. Int J Data Sci Anal 15, 359–378 (2023). https://doi.org/10.1007/s41060-022-00381-6

Download citation

Received: 27 September 2022
Accepted: 24 December 2022
Published: 03 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s41060-022-00381-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A novel apache spark-based 14-dimensional scalable feature extraction approach for the clustering of genomics data

Improved Feature Selection Algorithm for Biological Sequences Classification

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now