Abstract
Genome sequencing projects are rapidly contributing to the rise of high-dimensional protein sequence datasets. Extracting features from a high-dimensional protein sequence dataset poses many challenges. However, many features extraction methods exist, but extracting features from millions of protein sequences becomes impractical because these approaches are not scalable. Therefore, to design an efficient scalable feature extraction approach that extracts significant features, we have proposed two Apache Spark-based scalable feature extraction approaches that extracts significantly important features based on statistical properties from huge protein sequences, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in four clustering algorithms, i.e., scalable random sampling with iterative optimization fuzzy c-means (SRSIO-FCM), scalable literal fuzzy c-means (SLFCM), kernelized SRSIO-FCM (KSRSIO-FCM), and kernelized SLFCM (KSLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies–Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms achieve significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Guo, R., Zhao, Y., Zou, Q., Fang, X., Peng, S.: Bioinformatics applications on apache spark. GigaScience 7(8), giy098 (2018)
Alawneh, L., Shehab, M.A., Al-Ayyoub, M., Jararweh, Y., Al-Sharif, Z.A.: A scalable multiple pairwise protein sequence alignment acceleration using hybrid cpu-gpu approach. Clust. Comput. 23(4), 2677–2688 (2020)
Krause, A., Stoye, J., Vingron, M.: Large scale hierarchical clustering of protein sequences. BMC Bioinform. 6(1), 15 (2005)
Zou, Q., Lin, G., Jiang, X., Liu, X., Zeng, X.: Sequence clustering in bioinformatics: an empirical study. Brief. Bioinform. 21(1), 1–10 (2020)
Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 1–8 (2018)
Zeng, M., Zhang, F., Wu, F.X., Li, Y., Wang, J., Li, M.: Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 36(4), 1114–1120 (2020)
Han, K.F., Baker, D.: Recurring local sequence motifs in proteins. J. Mol. Biol. 251(1), 176–187 (1995)
Bystroff, C., Thorsson, V., Baker, D.: Hmmstr: a hidden markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301(1), 173–190 (2000)
Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mounika, M., Nagendra, N.: A novel scalable kernelized fuzzy clustering algorithms based on in-memory computation for handling big data. IEEE Trans. Emerg. Topic. Comput. Intell. 5, 908–919 (2020)
Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mounika, M., Nagendra, N.: Apache spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis. Comput. Biol. Chem 92, 107454 (2021)
Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Nagendra, N., Mounika, M.: Scalable incremental fuzzy consensus clustering algorithm for handling big data. Soft. Comput. pp 1–17 (2021b)
Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. & Geosci. 10(2–3), 191–203 (1984)
Zhang, C.T., Chou, K.C., Maggiora, G.: Predicting protein structural classes from amino acid composition: application of fuzzy clustering. Protein Eng. Des. Select. 8(5), 425–435 (1995)
Lu, T., Dou, Y., Zhang, C.: Fuzzy clustering of cpp family in plants with evolution and interaction analyses. BMC Bioinform. 14(S13), S10 (2013)
Farhangi, E., Ghadiri, N., Asadi, M., Nikbakht, MA., Pitre, S.: Fast and scalable protein motif sequence clustering based on hadoop framework. In: 2017 3th International Conference on Web Research (ICWR), IEEE, pp 24–31 (2017)
Chunduri, R.K., Cherukuri, A.K.: Scalable formal concept analysis algorithms for large datasets using spark. J. Ambient Intell. Humaniz. Comput. pp 1–21 (2018)
Oussous, A., Benjelloun, F.Z., Lahcen, A.A., Belfkih, S.: Big data technologies: a survey. J. King Saud Univ Comput Inform Sci 30(4), 431–448 (2018)
Bharill, N., Tiwari, A., Malviya, A.: Fuzzy based scalable clustering algorithms for handling big data using apache spark. IEEE Trans. Big Data 2(4), 339–352 (2016)
Vipsita, S., Rath, S.K.: Two-stage approach for protein superfamily classification. Comput. Biol. J. 2013 (2013)
Wang, J.T.L., Ma, Q., Shasha, D., Wu, C.H.: New techniques for extracting features from protein sequences. IBM Syst. J. 40(2), 426–441 (2001)
Wu, C., Whitson, G., McLarty, J., Ermongkonchai, A., Chang, T.C.: Protein classification artificial neural system. Protein Sci. 1(5), 667–677 (1992)
Dayhoff, M., Schwartz, R., Orcutt, B.: 22 a model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, vol. 5, pp. 345–352. National Biomedical Research Foundation Silver Spring, MD (1978)
Das, J.K., Sengupta, A., Choudhury, P.P., Roy, S.: Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis. Gene 766, 145096 (2021)
Bandyopadhyay, S.: An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst. 152(1), 5–16 (2005)
Mansoori, E.G., Zolghadri, M.J., Katebi, S.D., Mohabatkar, H., Boostani, R., Sadreddini, M.H.: Generating fuzzy rules for protein classification. Iran. J. Fuzzy Syst 5(2), 21–33 (2008)
Chou, K.C.: Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273(1), 236–247 (2011)
Chou, KC.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)
Yu, C., Deng, M., Cheng, S.Y., Yau, S.C., He, R.L., Yau, S.S.T.: Protein space: a natural method for realizing the nature of protein universe. J. Theor. Biol. 318, 197–204 (2013)
Gupta, M., Niyogi, R., Misra, M.: An alignment-free method to find similarity among protein sequences via the general form of chou’s pseudo amino acid composition. SAR and QSAR in Environ. Res. 24(7), 597–609 (2013)
Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Protein. Struct. Funct. Bioinform. 43(3), 246–255 (2001)
Bharill, N., Tiwari, A., Rawat, A.: A novel technique of feature extraction with dual similarity measures for protein sequence classification. Proced. Comput. Sci. 48, 795–801 (2015)
Mansoori, E.G., Zolghadri, M.J., Katebi, S.D.: Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans. NanoBiosci. 8(1), 92–99 (2009)
Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp 424–431 (2016)
Li, R., Hu, H., Li, H., Wu, Y., Yang, J.: Mapreduce parallel programming model: a state-of-the-art survey. Int. J. Parallel Program. 44(4), 832–866 (2016)
Le Nir, Y.: Spark and machine learning library. TORUS 1–Toward an open resource using services: Cloud Comput. Environ. Data pp 229–243 (2020)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association, pp 2–2 (2012)
Tang, S., He, B., Yu, C., Li, Y., Li, K.: A survey on spark ecosystem for big data processing. (2018) arXiv preprint arXiv:1811.08834
Chou, K.C.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)
Dayhoff, M.O.: A model of evolutionary change in proteins. Atlas Prot. Seq. Struct. 5, 89–99 (1972)
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)
Borthakur, D., et al.: Hdfs architecture guide. Hado. Apac. Proj. 53(1–13), 2 (2008)
Wysmierski, P.T., Vello, N.A.: The genetic base of brazilian soybean cultivars: evolution over time and breeding implications. Gene. Mole. Biol. 36(4), 547–555 (2013)
Sedivy, E.J., Wu, F., Hanzawa, Y.: Soybean domestication: the origin, genetic architecture and molecular bases. New Phytol. 214(2), 539–553 (2017)
Lee, J.D., Shannon, J.G., Vuong, T.D., Nguyen, H.T.: Inheritance of salt tolerance in wild soybean (glycine soja sieb. and zucc.) accession pi483463. J. Hered. 100(6), 798–801 (2009)
Xie, M., Chung, C.Y.L., Li, M.W., Wong, F.L., Wang, X., Liu, A., Wang, Z., Leung, A.K.Y., Wong, T.H., Tong, S.W., et al.: A reference-grade wild soybean genome. Nat. Commun. 10(1), 1–12 (2019)
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Signal Process. 83(4), 825–833 (2003)
Dugué, N., Lamirel, J.C., Chen, Y.: Evaluating clustering quality using features salience: a promising approach. Neural Comput. Appl. 33(19), 12939–12956 (2021)
Coelho, G.P., Barbante, C.C., Boccato, L., Attux, R.R., Oliveira, J.R., Von Zuben, F.J.: Automatic feature selection for bci: an analysis using the davies-bouldin index and extreme learning machines. In: The 2012 international joint conference on neural networks (IJCNN), IEEE, pp 1–8 (2012)
Shen, H.B., Yang, J., Liu, X.J., Chou, K.C.: Using supervised fuzzy clustering to predict protein structural classes. Biochem. Biophys. Res. Commun. 334(2), 577–581 (2005)
Acknowledgements
This work is supported by National Supercomputing Mission, HPC Applications Development Funded Research Project by DST in collaboration with the Ministry of Electronics and Information Technology (MeiTY), Govt. of India.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
All authors declare that there are no conflicts of interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jha, P., Tiwari, A., Bharill, N. et al. Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis. Int J Data Sci Anal 15, 359–378 (2023). https://doi.org/10.1007/s41060-022-00381-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-022-00381-6