Skip to main content

Advertisement

Log in

Unsupervised spectral feature selection algorithms for high dimensional data

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

It is a significant and challenging task to detect the informative features to carry out explainable analysis for high dimensional data, especially for those with very small number of samples. Feature selection especially the unsupervised ones are the right way to deal with this challenge and realize the task. Therefore, two unsupervised spectral feature selection algorithms are proposed in this paper. They group features using advanced Self-Tuning spectral clustering algorithm based on local standard deviation, so as to detect the global optimal feature clusters as far as possible. Then two feature ranking techniques, including cosine-similarity-based feature ranking and entropy-based feature ranking, are proposed, so that the representative feature of each cluster can be detected to comprise the feature subset on which the explainable classification system will be built. The effectiveness of the proposed algorithms is tested on high dimensional benchmark omics datasets and compared to peer methods, and the statistical test are conducted to determine whether or not the proposed spectral feature selection algorithms are significantly different from those of the peer methods. The extensive experiments demonstrate the proposed unsupervised spectral feature selection algorithms outperform the peer ones in comparison, especially the one based on cosine similarity feature ranking technique. The statistical test results show that the entropy feature ranking based spectral feature selection algorithm performs best. The detected features demonstrate strong discriminative capabilities in downstream classifiers for omics data, such that the AI system built on them would be reliable and explainable. It is especially significant in building transparent and trustworthy medical diagnostic systems from an interpretable AI perspective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

  1. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1): 389–422

    Article  Google Scholar 

  2. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez J M, Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences, 2014, 282: 111–135

    Article  Google Scholar 

  3. Xie J, Wang M, Xu S, Huang Z, Grant P W. The unsupervised feature selection algorithms based on standard deviation and cosine similarity for genomic data analysis. Frontiers in Genetics, 2021, 12: 684100

    Article  Google Scholar 

  4. Xie J Y, Wang M Z, Zhou Y, Gao H C, Xu S Q. Differential expression gene selection algorithms for unbalanced gene datasets. Chinese Journal of Computers, 2019, 42(6): 1232–1251

    Google Scholar 

  5. Wang M, Ding L, Xu M, Xie J, Wu S, Xu S, Yao Y, Liu Q. A novel method detecting the key clinic factors of portal vein system thrombosis of splenectomy & cardia devascularization patients for cirrhosis & portal hypertension. BMC Bioinformatics, 2019, 20(22): 720

    Article  Google Scholar 

  6. Xie J, Wu Z, Zheng Q. An adaptive 2D feature selection algorithm based on information gain and Pearson correlation coefficient. Journal of Shaanxi Normal University: Natural Science Edition, 2020, 48(6): 69–81

    Google Scholar 

  7. Hu X, Zhou P, Li P, Wang J, Wu X. A survey on online feature selection with streaming features. Frontiers of Computer Science, 2018, 12(3): 479–493

    Article  Google Scholar 

  8. Khan Z U, Pi D, Yao S, Nawaz A, Ali F, Ali S. piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm. Frontiers of Computer Science, 2021, 15(6): 156904

    Article  Google Scholar 

  9. Chen J, Zeng Y, Li Y, Huang G B. Unsupervised feature selection based extreme learning machine for clustering. Neurocomputing, 2020, 386: 198–207

    Article  Google Scholar 

  10. Lim H, Kim D W. Pairwise dependence-based unsupervised feature selection. Pattern Recognition, 2021, 111: 107663

    Article  Google Scholar 

  11. Feng J, Jiao L, Liu F, Sun T, Zhang X. Unsupervised feature selection based on maximum information and minimum redundancy for hyperspectral images. Pattern Recognition, 2016, 51: 295–309

    Article  Google Scholar 

  12. Xie J Y, Gao H C. Statistical correlation and k-means based distinguishable gene subset selection algorithms. Journal of Software, 2014, 25(9): 2050–2075

    Google Scholar 

  13. Xie J, Gao H, Xie W, Liu X, Grant P W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Information Sciences, 2016, 354: 19–40

    Article  Google Scholar 

  14. Bhattacharjee P, Mitra P. A survey of density based clustering algorithms. Frontiers of Computer Science, 2021, 15(1): 151308

    Article  Google Scholar 

  15. Bhattacharjee P, Mitra P. iMass: an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity. Frontiers of Computer Science, 2021, 15(2): 1–3

    Article  Google Scholar 

  16. Song Q, Ni J, Wang G. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(1): 1–14

    Article  Google Scholar 

  17. Xie J, Wang M, Zhou Y, Li J. Coordinating discernibility and independence scores of variables in a 2D space for efficient and accurate feature selection. In: Proceedings of the 12th International Conference on Intelligent Computing. 2016, 116–127

  18. Xue H, Li S, Chen X, Wang Y. A maximum margin clustering algorithm based on indefinite kernels. Frontiers of Computer Science, 2019, 13(4): 813–827

    Article  Google Scholar 

  19. Likas A, Vlassis N, Verbeek J J. The global k-means clustering algorithm. Pattern Recognition, 2003, 36(2): 451–461

    Article  Google Scholar 

  20. Xie J Y, Jiang S, Wang C X, Zhang Y, Xie W X. An improved global k-means clustering algorithm. Journal of Shaanxi Normal University: Natural Science Edition, 2010, 38(2): 18–22

    Google Scholar 

  21. Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing, 2007, 17(4): 395–416

    Article  MathSciNet  Google Scholar 

  22. Zhang X, You Q. An improved spectral clustering algorithm based on random walk. Frontiers of Computer Science in China, 2011, 5(3): 268–278

    Article  MathSciNet  Google Scholar 

  23. Ng A Y, Jordan M I, Weiss Y. On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. 2001, 849–856

  24. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888–905

    Article  Google Scholar 

  25. Zelnik-Manor L, Perona P. Self-tuning spectral clustering. In: Proceedings of the 17th International Conference on Neural Information Processing Systems. 2004, 1601–1608

  26. Alpert S Z Y C J. Spectral partitioning: the more eigenvectors, the better. In: Proceedings of the 32nd Design Automation Conference. 1995, 195–200

  27. Weiss Y. Segmentation using eigenvectors: a unifying view. In: Proceedings of the 7th IEEE International Conference on Computer Vision. 1999, 975–982

  28. Xie J, Zhou Y, Ding L. Local standard deviation spectral clustering. In: Proceedings of 2018 IEEE International Conference on Big Data and Smart Computing (BigComp). 2018, 242–250

  29. Xie J Y, Ding L J. The true self-adaptive spectral clustering algorithms. Acta Electronica Sinica, 2019, 47(5): 1000–1008

    Google Scholar 

  30. Zhao Z, Liu H. Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 1151–1157

  31. García-García D, Santos-Rodríguez R. Spectral clustering and feature selection for microarray data. In: Proceedings of 2009 International Conference on Machine Learning and Applications. 2009, 425–428

  32. Zhou S, Liu X, Zhu C, Liu Q, Yin J. Spectral clustering-based local and global structure preservation for feature selection. In: Proceedings of 2014 International Joint Conference on Neural Networks (IJCNN). 2014, 550–557

  33. He X, Cai D, Niyogi P. Laplacian score for feature selection. In: Proceedings of the 18th International Conference on Neural Information Processing Systems. 2005, 507–514

  34. Cai D, Zhang C, He X. Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 333–342

  35. Qian M, Zhai C. Robust unsupervised feature selection. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2013, 1621–1627

  36. Li Z, Yang Y, Liu J, Zhou X, Lu H. Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence. 2012, 1026–1032

  37. He J, Bi Y, Ding L, Li Z, Wang S. Unsupervised feature selection based on decision graph. Neural Computing and Applications, 2017, 28(10): 3047–3059

    Article  Google Scholar 

  38. Xie J Y, Ding L J, Wang M Z. Spectral clustering based unsupervised feature selection algorithms. Journal of Software, 2020, 31(4): 1009–1024

    Google Scholar 

  39. Baldi P, Brunak S, Chauvin Y, Andersen C A F, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 2000, 16(5): 412–424

    Article  Google Scholar 

  40. Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 233–240

  41. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27(8): 861–874

    Article  MathSciNet  Google Scholar 

  42. Vapnik V N. The Nature of Statistical Learning Theory. Berlin: Springer Science & Business Media, 2013

  43. Dash M, Liu H. Feature selection for clustering. In: Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications. 2000, 110–121

  44. Dash M, Choi K, Scheuermann P, Liu H. Feature selection for clustering — a filter solution. In: Proceedings of the 2002 IEEE International Conference on Data Mining. 2002, 115–122

  45. Han J, Pei J, Kamber M. Data Mining: Concepts and Techniques. Amsterdam: Elsevier, 2011

    Google Scholar 

  46. Luo F, Huang H, Ma Z, Liu J. Semisupervised sparse manifold discriminative analysis for feature extraction of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 2016, 54(10): 6197–6211

    Article  Google Scholar 

  47. Luo F, Zou Z, Liu J, Lin Z. Dimensionality reduction and classification of hyperspectral image via multistructure unified discriminative embedding. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5517916

    Article  Google Scholar 

  48. Zhao F, Jiao L, Liu H, Gao X, Gong M. Spectral clustering with eigenvector selection based on entropy ranking. Neurocomputing, 2010, 73(10–12): 1704–1717

    Article  Google Scholar 

  49. Alon U, Barkai N, Notterman D A, Gish K, Ybarra S, Mack D, Levine A J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 1999, 96(12): 6745–6750

    Article  Google Scholar 

  50. Alizadeh A A, Eisen M B, Davis R E, Ma C, Lossos I S, Rosenwald A, Boldrick J C, Sabet H, Tran T, Yu X, Powell J I, Yang L, Marti G E, Moore T, Hudson J Jr, Lu L, Lewis D B, Tibshirani R, Sherlock G, Chan W C, Greiner T C, Weisenburger D D, Armitage J O, Warnke R, Levy R, Wilson W, Grever M R, Byrd J C, Botstein D, Brown P O, Staudt L M. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 2000, 403(6769): 503–511

    Article  Google Scholar 

  51. Shipp M A, Ross K N, Tamayo P, Weng A P, Kutok J L, Aguiar R C T, Gaasenbeek M, Angelo M, Reich M, Pinkus G S, Ray T S, Koval M A, Last K W, Norton A, Lister T A, Mesirov J, Neuberg D S, Lander E S, Aster J C, Golub T R. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 2002, 8(1): 68–74

    Article  Google Scholar 

  52. Notterman D A, Alon U, Sierk A J, Levine A J. Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. Cancer Research, 2001, 61(7): 3124–3130

    Google Scholar 

  53. Chandran U R, Ma C, Dhir R, Bisceglia M, Lyons-Weiler M, Liang W, Michalopoulos G, Becich M, Monzon F A. Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer, 2007, 7(1): 64

    Article  Google Scholar 

  54. Singh D, Febbo P G, Ross K, Jackson D G, Manola J, Ladd C, Tamayo P, Renshaw A A, D’Amico A V, Richie J P, Lander E S, Loda M, Kantoff P W, Golub T R, Sellers W R. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 2002, 1(2): 203–209

    Article  Google Scholar 

  55. Khan J, Wei J S, Ringnér M, Saal L H, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C R, Peterson C, Meltzer P S. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 2001, 7(6): 673–679

    Article  Google Scholar 

  56. Li J, Cheng K, Wang S, Morstatter F, Trevino R P, Tang J, Liu H. Feature selection: a data perspective. ACM Computing Surveys, 2018, 50(6): 94

    Article  Google Scholar 

  57. Bajwa G, DeBerardinis R J, Shao B, Hall B, Farrar J D, Gill M A. Cutting edge: critical role of glycolysis in human plasmacytoid dendritic cell antiviral responses. The Journal of Immunology, 2016, 196(5): 2004–2009

    Article  Google Scholar 

  58. Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27

    Article  Google Scholar 

  59. Friedman M. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 1940, 11(1): 86–92

    Article  MathSciNet  Google Scholar 

  60. Nemenyi P B. Distribution-free multiple comparisons. Princeton University, Dissertation, 1963

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 62076159, 12031010, 61673251, and 61771297), and was also supported by the Fundamental Research Funds for the Central Universities (GK202105003), the Natural Science Basic Research Program of Shaanxi Province of China (2022JM-334), and the Innovation Funds of Graduate Programs at Shaanxi Normal University (2015CXS028 and 2016CSY009).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhao Huang or Juanying Xie.

Additional information

Mingzhao Wang is a post doctor supervised by professor Juanying XIE at School of Computer Science in Shaanxi Normal University, China. He got his PhD degree in bioinformatics and MS degree in software and theory of computer science from Shaanxi Normal University, China in 2021 and 2017, respectively. He got his BS degree in computer science from Shanxi Normal University, China in 2014. His research interests include machine learning and bioinformatics.

Henry Han received the PhD degree from University of Iowa, USA in 2004. He is currently McCollum Endowed Chair in Data Science and professor of computer science with the Department of Computer in the Rogers College of Engineering and Computer Science at Baylor University, USA. His current research interests include AI, data science, fintech, big data, quantum computing, and cybersecurity. He published more than 90 articles in leading journals and conferences in these fields. He was professor of computer science at Fordham University, USA and the founding director of MS in cybersecurity besides the associate Chair of department of computer information science.

Zhao Huang received the MSc degree in information science from City University, UK in 2006 and PhD degree in the field of information systems and computing at Brunel University, UK in 2011. From 2011 to 2013, he was a postdoctoral research fellow at the Telfer School of Management at the University of Ottawa, Canada. He is an associate professor at School of Computer Science, Shaanxi Normal University, China. His research interests include information systems, human-computer interaction, and intelligent recommendation systems.

Juanying Xie is a professor and a PhD student supervisor at School of Computer Science in Shaanxi Normal University, China. She got her PhD and MS degrees from Xidian University, China in 2012 and 2004, respectively. She got her BS degree from Shaanxi Normal University, China in 1993. Her research interests include machine learning, data mining, and biomedical data analysis. Her research is highly cited, with one article in the top 1% of ESI and one as Top 3 in the hotspot articles of “SCIENTIA SINICA Informationis” and 3 articles included in F5000. She is an associate editor of Health Information Science and Systems, and an editor board member of the journal of Shaanxi Normal University (Natural Science Edition).

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, M., Han, H., Huang, Z. et al. Unsupervised spectral feature selection algorithms for high dimensional data. Front. Comput. Sci. 17, 175330 (2023). https://doi.org/10.1007/s11704-022-2135-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-2135-0

Keywords