Abstract
Unsupervised feature selection has attracted increasing attention for its promising performance on high dimensional data with higher dimensionality and more expensive labeling costs. Existing unsupervised feature selection methods mostly assume that linear relationships can interpret all feature associations. However, data with exclusively linear relationships are rare and impractical. Moreover, the quality of the similarity matrix significantly affects the effectiveness of conventional spectral-based methods. Real-world data contains lots of noise and redundancy, making the similarity matrix built using the raw data unreliable. To address these problems, we propose a novel and robust method for feature selection over a novel nonlinear mapping function, aiming to mine the nonlinear relationships among features. Furthermore, we incorporated manifold learning into our training process, embedded with adaptive graph constraints based on the principle of maximum entropy, to maintain the intrinsic structure of the data and simultaneously capture more accurate information. An efficient and effective algorithm was designed to perform our method. Experiments with eight benchmark datasets from face images, biology, and time series outperformed nine state-of-the-art algorithms, validating the superiority and effectiveness of our method. The source code is available at https://github.com/aasdlaca/NRASP.
This work was supported in part by the National Natural Science Foundation of China under Grant 62306244, and in part by the Key Project of Shaanxi Provision-City Linkage under Grant 2022GD-TSLD-53.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atashgahi, Z., et al.: Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders. Mach. Learn. 111(1), 377–414 (2022)
Balın, M.F., Abid, A., Zou, J.: Concrete autoencoders: differentiable feature selection and reconstruction. In: Proceedings of the International Conference on Machine Learning, pp. 444–453 (2019)
Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 333–342 (2010)
Gong, X., Yu, L., Wang, J., Zhang, K., Bai, X., Pal, N.R.: Unsupervised feature selection via adaptive autoencoder with redundancy control. Neural Netw. 150, 87–101 (2022)
Gu, Q., Li, Z., Han, J.: Joint feature selection and subspace learning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1294–1299 (2011)
Han, K., Wang, Y., Zhang, C., Li, C., Xu, C.: Autoencoder inspired unsupervised feature selection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2941–2945 (2018)
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS], pp. 507–514 (2005)
Huang, Q., Xia, T., Sun, H., Yamada, M., Chang, Y.: Unsupervised nonlinear feature selection from high-dimensional signed networks. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), pp. 4182–4189 (2020)
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)
Li, X., Zhang, H., Zhang, R., Liu, Y., Nie, F.: Generalized uncorrelated regression with adaptive graph for unsupervised feature selection. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1587–1595 (2019)
Li, Z., Yang, Y., Liu, J., Zhou, X., Lu, H.: Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence (2012)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989)
Mahmud, M., Kaiser, M.S., Hussain, A., Vassanelli, S.: Applications of deep learning and reinforcement learning to biological data. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2063–2079 (2018)
Nie, F., Zhu, W., Li, X.: Unsupervised feature selection with structured graph optimization. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1302–1308 (2016)
Qian, M., Zhai, C.: Robust unsupervised feature selection. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1621–1627 (2013)
Saberian, M.J., Vasconcelos, N.: Boosting algorithms for simultaneous feature extraction and selection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2448–2455 (2012)
Yang, Y., Shen, H.T., Ma, Z., Huang, Z., Zhou, X.: \({l}_{{2,1}}\)-norm regularized discriminative feature selection for unsupervised learning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1589–1594 (2011)
You, M., Ban, L., Wang, Y., Kang, J., Wang, G., Yuan, A.: Unsupervised feature selection with joint self-expression and spectral analysis via adaptive graph constraints. Multim. Tools Appl. 82(4), 5879–5898 (2023)
You, M., Yuan, A., He, D., Li, X.: Unsupervised feature selection via neural networks and self-expression with adaptive graph constraint. Pattern Recognit. 135, 109173 (2023)
You, M., Yuan, A., Zou, M., He, D.J., Li, X.: Robust unsupervised feature selection via multi-group adaptive graph representation. In: TKDE, p. 1 (2021)
Yuan, A., Huang, J., Wei, C., Zhang, W., Zhang, N., You, M.: Unsupervised feature selection via feature-grouping and orthogonal constraint. In: International Conference on Pattern Recognition (ICPR), pp. 720–726 (2022)
Yuan, A., You, M., He, D., Li, X.: Convex non-negative matrix factorization with adaptive graph for unsupervised feature selection. IEEE Trans. Cybern. 52(6), 5522–5534 (2022)
Zhang, Y., et al.: Unsupervised nonnegative adaptive feature extraction for data representation. IEEE Trans. Knowl. Data Eng. 31(12), 2423–2440 (2019)
Zhu, P., Zhu, W., Hu, Q., Zhang, C., Zuo, W.: Subspace clustering guided unsupervised feature selection. Pattern Recogn. 66(C), 364–374 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Derivation
1.1 A.1 Derivation of the Manifold Structure Preservation Item
Recalling the aforementioned definition of the manifold structure preservation, if the data points are close in the original data space, the projected points \(W^{T}x_i\) and \(W^{T}x_j\) should also have a small distance. Therefore, We can get the manifold structure preservation item as:
where \(s_{ij}\) denotes the similarity between the data \(x_i\) and \(x_j\). The value of \(||W^{T}x_i - W^{T}x_j||_2^2\) can be large when the value of \(s_{ij}\) is small. Therefore, the neighbor relationship of the original data points can still be maintained in the mapped data points.
We can verify that
where \(L_S\) is a Laplacian matrix. \(Tr(\cdot )\) denotes the trace of matrix. \(L_S\) is calculated by \(L_S = D - (\frac{S+S^{T}}{2}) \), where D is a diagonal matrix, and its elements are defined as:
1.2 A.2 Derivation of the KKT Condition
With the Lagrangian multiplier method, Eq. (6) (in main body) is rewritten as:
where \(M = [\mu _1, \mu _2, \ldots , \mu _m]\) and \(\Lambda = [\lambda _{ij}]_{m \times m}\) are Lagrangian multipliers. The KKT conditions of Eq. (17) are summarized as
Based on Eq. (18), we can get the optimal solution of \(s_{ij}\) shown in Eq. (7) (in main body).
B Experiment Setup
For the validation of the effectiveness of our method, we made a comparison with the baseline of all the features and nine representative unsupervised FS methods.
These methods are briefly described as follows.
-
1.
All-Fea: Use all features for clustering. This method was used as the baseline to verify whether the selected features can outperform all the features in clustering.
-
2.
Laplacian score (LS) [7]: This method measures feature using variance and local structure preservation ability.
-
3.
Multi-cluster feature selection (MCFS) [3]: This method uses \({{l}_{1}}\)-regularized regression model with spectral analysis to select the most important features. It is capable of preserving data’s clustering structure.
-
4.
Nonnegative discriminative feature selection (NDFS) [11]: This method utilizes discriminative information of data. It incorporates clustering label learning into FS. In addition, the method used a nonnegative constraint for more accurate cluster labels.
-
5.
Unsupervised discriminative feature selection (UDFS) [17]: This method integrates discriminative analysis with \({{l}_{2,1}}\)-norm regularization within a unified framework to exploit discriminative information and perform an unsupervised FS problem.
-
6.
Generalized uncorrelated regression with the adaptive graph for unsupervised feature selection (URAFS) [10]: This method selects features and performs manifold learning simultaneously using an uncorrelated regression model while incorporating the data’s geometric structure into the manifold learning process.
-
7.
Autoencoder feature selection (AEFS) [6]: This method combines an autoencoder network and group LASSO by excavating the linear and nonlinear information among features to perform FS.
-
8.
Concrete autoencoder (CAE) [2]: This method proposed a concrete autoencoder to implement separable feature selection and reconstruction. CAE uses a concrete selector layer with an effective learning algorithm that can converge to a discrete feature subset.
-
9.
Quick selection (QS) [1]: This method selects features by the strength of the neurons from trained sparse denoising autoencoders with a sparse evolution strategy for training.
It should be noted that the Laplacian score (LS) [7] method does not belong to the embedding-based methods. LS measures features using variance and local structure preservation ability. Due to its efficiency and decent performance, LS remains a popular method for FS. Hence, we include it in our comparative study. Besides, the All-Fea method is a baseline method that includes all features.
To evaluate the performance of various unsupervised methods, we utilized the K-nearest neighbor (KNN) algorithm in LS, MCFS, UDFS, and NDFS, where we set the number of nearest neighbors to five. In addition, the activation functions of the encoder and decoder are set as tanh functions in our method. For the initialization of parameters, we utilized the grid-search strategy from \(\left\{ 10^{-3}, 10^{-2}, \ldots , 10^{3}\right\} \) on parameters to find the optimal parameters in UDFS, URAFS, AEFS, and our method. In the AEFS, we set the size of the neurons in the hidden layer to 256. We selected one hidden layer in CAE, and the activation function was selected as LeakyRelu(0.2). In QS, we utilize the grid-search strategy from \(\left\{ 0.1, 0.2, 0.3, 0.4, 0.5\right\} \) for parameter \(\zeta \) and from \(\left\{ 2, 5, 10, 13, 20, 25\right\} \) for parameter \(\varepsilon \). In particular, in our proposed method, we also use the grid-search strategy from \(\left\{ 10^{-3}, 10^{-2}, \ldots , 10^{3}\right\} \) for parameters \(\alpha \), \(\beta \) and \(\gamma \). We selected \(k\in \left\{ 20, 40, \ldots , 300\right\} \) features respectively to conduct the experiments.
Considering Eq. (13) (in main body), the W is regularized to be sparse in rows, \(||w_i||_2\) is most likely zero during training. To avoid this, we add a small positive constant \(\epsilon \) close enough to infinitesimal, aiming to ensure \(Q_{ii}\) to be differentiable. Subsequently, Q is transformed into \(Q^{'}\) such that the i-th elements are defined as
Replacing Q with \(Q^{'}\), Eq. (12) (in main body) can be written as follows:
C Visualization
The training results of the novel nonlinear self-representation are visualized in Fig. 4, where the input sample image is shown in Fig. 4. The reconstructed image is shown in Fig. 4(b). This shows that the nonlinear self-representation model effectively reconstructs the sample by preserving the intrinsic structure and the linear and nonlinear relationships among the original features. Figure 4(c) presents the importance \(\Vert {{w}_{i}}\Vert \) of each feature i, which is reshaped by the shape of one sample in MNIST, such as that in the top left of Fig. 4(a). The 40 most important features of the MNIST dataset are presented in Fig. 4(d). From Fig. 4(c) and Fig. 4(d), the meaningful features are primarily concentrated in the middle of the image rather than the edges, which is highly consistent with the practical situation. Meanwhile, the region composed of features with high importance is similar to the shape of the number 3. These may imply that the top, bottom, middle, and right parts are the key to distinguishing between different numbers.
D Ablation Study
When it is necessary to understand the function of each part of the model, ablation experiments are commonly performed. We studied the role of each part of the proposed model by removing different parts. Each of the combinations shown in Tables 3 and 4 is presented as follows:
-
1.
Base linear model: The linear self-representation model with sparse constraints is as follows:
$$\begin{aligned} \mathcal {L}=\left\| X-XW\right\| _F^2+\lambda \Vert {W}\Vert _{2,1} \end{aligned}$$(21) -
2.
Nonlinear self-representation part (b): Basic model of nonlinear mapping based on the method of self-representation and added by sparse constraints. The loss function is
$$\begin{aligned} \mathop {\min }\limits _{W, \varTheta } \quad \mathcal {L}(W; \varTheta ) = \left\| {X - f\left( {g\left( {XW} \right) } \right) } \right\| _F^2 + \;\alpha {\left\| W \right\| _{2,1}} \end{aligned}$$(22) -
3.
\(b+\beta \): Incorporated base nonlinear self-representation model with manifold learning that has a fixed similarity matrix.
$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{W, \varTheta } \quad \mathcal {L}(W; \varTheta ) = &{}\left\| {X - f\left( {g\left( {XW} \right) } \right) } \right\| _F^2 + \;\alpha {\left\| W \right\| _{2,1}}+ \beta Tr({W^T}{X^T}{L_S}XW) \end{aligned} \end{aligned}$$(23) -
4.
NRASP (\(b+\beta +\gamma \)): The whole proposed model by Eq. (5) (in main body).
As shown in Tables 3 and 4, we first consider the linear self-representation model as the basis on which we introduce nonlinear mapping method adopting the self-representation idea for comparison. It shows comparatively superior performance on most biological datasets, such as lymphoma, TOX_171, and GLIOMA, corroborating the better nonlinear learning ability of the nonlinear self-representation model than the linear model. However, the basic nonlinear self-representation model underperformed on the warpPIE10P image dataset but demonstrated even better performance when combined with manifold learning, which was conducted on all datasets. This indicates that for the information in the data, the nonlinear self-representation model can capture the nonlinear information well, while the structural information of the data may be omitted. Furthermore, when the similarity matrix is dynamically updated for manifold learning, it achieves excellent performance on all but the TOX_171 and HAR datasets, which shows that the structural information in the data is better captured. By comparing the combination of different parts, the NRASP can balance the different parts well and achieve superior performance.
E Detailed Results
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yuan, A., Lin, L., Tian, P., Zhang, Q. (2024). Unsupervised Feature Selection via Nonlinear Representation and Adaptive Structure Preservation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14431. Springer, Singapore. https://doi.org/10.1007/978-981-99-8540-1_12
Download citation
DOI: https://doi.org/10.1007/978-981-99-8540-1_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8539-5
Online ISBN: 978-981-99-8540-1
eBook Packages: Computer ScienceComputer Science (R0)