Unsupervised Feature Selection via Nonlinear Representation and Adaptive Structure Preservation

Yuan, Aihong; Lin, Lin; Tian, Peiqi; Zhang, Qinrong

doi:10.1007/978-981-99-8540-1_12

Aihong Yuan¹⁵,
Lin Lin¹⁵,
Peiqi Tian¹⁵ &
…
Qinrong Zhang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14431))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

361 Accesses

Abstract

Unsupervised feature selection has attracted increasing attention for its promising performance on high dimensional data with higher dimensionality and more expensive labeling costs. Existing unsupervised feature selection methods mostly assume that linear relationships can interpret all feature associations. However, data with exclusively linear relationships are rare and impractical. Moreover, the quality of the similarity matrix significantly affects the effectiveness of conventional spectral-based methods. Real-world data contains lots of noise and redundancy, making the similarity matrix built using the raw data unreliable. To address these problems, we propose a novel and robust method for feature selection over a novel nonlinear mapping function, aiming to mine the nonlinear relationships among features. Furthermore, we incorporated manifold learning into our training process, embedded with adaptive graph constraints based on the principle of maximum entropy, to maintain the intrinsic structure of the data and simultaneously capture more accurate information. An efficient and effective algorithm was designed to perform our method. Experiments with eight benchmark datasets from face images, biology, and time series outperformed nine state-of-the-art algorithms, validating the superiority and effectiveness of our method. The source code is available at https://github.com/aasdlaca/NRASP.

This work was supported in part by the National Natural Science Foundation of China under Grant 62306244, and in part by the Key Project of Shaanxi Provision-City Linkage under Grant 2022GD-TSLD-53.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atashgahi, Z., et al.: Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders. Mach. Learn. 111(1), 377–414 (2022)
Google Scholar
Balın, M.F., Abid, A., Zou, J.: Concrete autoencoders: differentiable feature selection and reconstruction. In: Proceedings of the International Conference on Machine Learning, pp. 444–453 (2019)
Google Scholar
Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 333–342 (2010)
Google Scholar
Gong, X., Yu, L., Wang, J., Zhang, K., Bai, X., Pal, N.R.: Unsupervised feature selection via adaptive autoencoder with redundancy control. Neural Netw. 150, 87–101 (2022)
Article Google Scholar
Gu, Q., Li, Z., Han, J.: Joint feature selection and subspace learning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1294–1299 (2011)
Google Scholar
Han, K., Wang, Y., Zhang, C., Li, C., Xu, C.: Autoencoder inspired unsupervised feature selection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2941–2945 (2018)
Google Scholar
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS], pp. 507–514 (2005)
Google Scholar
Huang, Q., Xia, T., Sun, H., Yamada, M., Chang, Y.: Unsupervised nonlinear feature selection from high-dimensional signed networks. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), pp. 4182–4189 (2020)
Google Scholar
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)
Article MathSciNet Google Scholar
Li, X., Zhang, H., Zhang, R., Liu, Y., Nie, F.: Generalized uncorrelated regression with adaptive graph for unsupervised feature selection. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1587–1595 (2019)
Article MathSciNet Google Scholar
Li, Z., Yang, Y., Liu, J., Zhou, X., Lu, H.: Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence (2012)
Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989)
Article MathSciNet Google Scholar
Mahmud, M., Kaiser, M.S., Hussain, A., Vassanelli, S.: Applications of deep learning and reinforcement learning to biological data. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2063–2079 (2018)
Article MathSciNet Google Scholar
Nie, F., Zhu, W., Li, X.: Unsupervised feature selection with structured graph optimization. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1302–1308 (2016)
Google Scholar
Qian, M., Zhai, C.: Robust unsupervised feature selection. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1621–1627 (2013)
Google Scholar
Saberian, M.J., Vasconcelos, N.: Boosting algorithms for simultaneous feature extraction and selection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2448–2455 (2012)
Google Scholar
Yang, Y., Shen, H.T., Ma, Z., Huang, Z., Zhou, X.: ${l}_{{2,1}}$-norm regularized discriminative feature selection for unsupervised learning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1589–1594 (2011)
Google Scholar
You, M., Ban, L., Wang, Y., Kang, J., Wang, G., Yuan, A.: Unsupervised feature selection with joint self-expression and spectral analysis via adaptive graph constraints. Multim. Tools Appl. 82(4), 5879–5898 (2023)
Article Google Scholar
You, M., Yuan, A., He, D., Li, X.: Unsupervised feature selection via neural networks and self-expression with adaptive graph constraint. Pattern Recognit. 135, 109173 (2023)
Article Google Scholar
You, M., Yuan, A., Zou, M., He, D.J., Li, X.: Robust unsupervised feature selection via multi-group adaptive graph representation. In: TKDE, p. 1 (2021)
Google Scholar
Yuan, A., Huang, J., Wei, C., Zhang, W., Zhang, N., You, M.: Unsupervised feature selection via feature-grouping and orthogonal constraint. In: International Conference on Pattern Recognition (ICPR), pp. 720–726 (2022)
Google Scholar
Yuan, A., You, M., He, D., Li, X.: Convex non-negative matrix factorization with adaptive graph for unsupervised feature selection. IEEE Trans. Cybern. 52(6), 5522–5534 (2022)
Article Google Scholar
Zhang, Y., et al.: Unsupervised nonnegative adaptive feature extraction for data representation. IEEE Trans. Knowl. Data Eng. 31(12), 2423–2440 (2019)
Google Scholar
Zhu, P., Zhu, W., Hu, Q., Zhang, C., Zuo, W.: Subspace clustering guided unsupervised feature selection. Pattern Recogn. 66(C), 364–374 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

The College of Information Engineering, Northwest A &F University, Xianyang, China
Aihong Yuan, Lin Lin, Peiqi Tian & Qinrong Zhang

Authors

Aihong Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Lin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Peiqi Tian
View author publications
You can also search for this author in PubMed Google Scholar
Qinrong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aihong Yuan .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Appendices

A Derivation

1.1 A.1 Derivation of the Manifold Structure Preservation Item

Recalling the aforementioned definition of the manifold structure preservation, if the data points are close in the original data space, the projected points $W^{T}x_i$ and $W^{T}x_j$ should also have a small distance. Therefore, We can get the manifold structure preservation item as:

$$\begin{aligned} \min \limits _{W}{\frac{1}{2}{\sum \limits _{i,j}{\left\| {W^{T}x_{i} - W^{T}x_{j}} \right\| _{2}^{2}s_{ij}}}} \end{aligned}$$

(14)

where $s_{ij}$ denotes the similarity between the data $x_i$ and $x_j$. The value of $||W^{T}x_i - W^{T}x_j||_2^2$ can be large when the value of $s_{ij}$ is small. Therefore, the neighbor relationship of the original data points can still be maintained in the mapped data points.

We can verify that

$$\begin{aligned} \begin{aligned} {}&\frac{1}{2}\sum _{i,j}{||{W}^{T}}{{x}_{i}}-{{W}^{T}}{{x}_{j}}||_{2}^{2}{{s}_{ij}}\\ &=\sum _{i=1}^n(W^T x_i )^T W^T x_i D_{ii}- \sum _{i,j=1}^n(W^T x_i )^T W^T x_j s_{ij}\\ &=Tr(W^T X^T DXW)- Tr(W^T X^T SXW)\\ &=Tr(W^T X^T L_S XW) \end{aligned} \end{aligned}$$

(15)

where $L_S$ is a Laplacian matrix. $Tr(\cdot )$ denotes the trace of matrix. $L_S$ is calculated by $L_S = D - (\frac{S+S^{T}}{2}) $, where D is a diagonal matrix, and its elements are defined as:

$$\begin{aligned} D_{ii} = \sum _{j=1}^m \frac{\left( s_{i j}+s_{ji}\right) }{2}, i=1,2,\cdots ,m. \end{aligned}$$

(16)

1.2 A.2 Derivation of the KKT Condition

With the Lagrangian multiplier method, Eq. (6) (in main body) is rewritten as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{1}^{'} &{}= \frac{1}{2} \beta \sum _{i=1}^m \sum _{j=1}^m{||{W}^{T}}{{x}_{i}}-{{W}^{T}}{{x}_{j}}||_{2}^{2}{{s}_{ij}}+ \gamma \sum _{i=1}^m \sum _{j=1}^m s_{ij} \log s_{ij}\\ &{}~~~~ +\sum _{i=1}^m \sum _{j=1}^m \lambda _{ij} s_{ij}+\sum _{i=1}^m \mu _i\left( \sum _{j=1}^m s_{i j}-1\right) \end{aligned} \end{aligned}$$

(17)

where $M = [\mu _1, \mu _2, \ldots , \mu _m]$ and $\Lambda = [\lambda _{ij}]_{m \times m}$ are Lagrangian multipliers. The KKT conditions of Eq. (17) are summarized as

$$\begin{aligned} \begin{aligned} \left\{ \begin{array}{l} \dfrac{\partial \mathcal {L}_{1}^{'}}{\partial s_{i j}}=\dfrac{\beta }{2}\left\| W^{T} x_i-W^{T} x_j\right\| _{2}^{2}\gamma \left( \log s_{i j}+1\right) +\lambda _{i j}+\mu _i=0 \\ s_{i j} \ge 0,~~ \lambda _{i j} \ge 0,~~ \lambda _{i j} s_{i j} = 0,~~ \sum _{j=1}^m s_{i j}=1 . \end{array}\right. \end{aligned} \end{aligned}$$

(18)

Based on Eq. (18), we can get the optimal solution of $s_{ij}$ shown in Eq. (7) (in main body).

B Experiment Setup

Table 2. Statistics of used datasets.

Full size table

For the validation of the effectiveness of our method, we made a comparison with the baseline of all the features and nine representative unsupervised FS methods.

These methods are briefly described as follows.

1.
All-Fea: Use all features for clustering. This method was used as the baseline to verify whether the selected features can outperform all the features in clustering.
2.
Laplacian score (LS) [7]: This method measures feature using variance and local structure preservation ability.
3.
Multi-cluster feature selection (MCFS) [3]: This method uses ${{l}_{1}}$-regularized regression model with spectral analysis to select the most important features. It is capable of preserving data’s clustering structure.
4.
Nonnegative discriminative feature selection (NDFS) [11]: This method utilizes discriminative information of data. It incorporates clustering label learning into FS. In addition, the method used a nonnegative constraint for more accurate cluster labels.
5.
Unsupervised discriminative feature selection (UDFS) [17]: This method integrates discriminative analysis with ${{l}_{2,1}}$-norm regularization within a unified framework to exploit discriminative information and perform an unsupervised FS problem.
6.
Generalized uncorrelated regression with the adaptive graph for unsupervised feature selection (URAFS) [10]: This method selects features and performs manifold learning simultaneously using an uncorrelated regression model while incorporating the data’s geometric structure into the manifold learning process.
7.
Autoencoder feature selection (AEFS) [6]: This method combines an autoencoder network and group LASSO by excavating the linear and nonlinear information among features to perform FS.
8.
Concrete autoencoder (CAE) [2]: This method proposed a concrete autoencoder to implement separable feature selection and reconstruction. CAE uses a concrete selector layer with an effective learning algorithm that can converge to a discrete feature subset.
9.
Quick selection (QS) [1]: This method selects features by the strength of the neurons from trained sparse denoising autoencoders with a sparse evolution strategy for training.

It should be noted that the Laplacian score (LS) [7] method does not belong to the embedding-based methods. LS measures features using variance and local structure preservation ability. Due to its efficiency and decent performance, LS remains a popular method for FS. Hence, we include it in our comparative study. Besides, the All-Fea method is a baseline method that includes all features.

To evaluate the performance of various unsupervised methods, we utilized the K-nearest neighbor (KNN) algorithm in LS, MCFS, UDFS, and NDFS, where we set the number of nearest neighbors to five. In addition, the activation functions of the encoder and decoder are set as tanh functions in our method. For the initialization of parameters, we utilized the grid-search strategy from $\left\{ 10^{-3}, 10^{-2}, \ldots , 10^{3}\right\} $ on parameters to find the optimal parameters in UDFS, URAFS, AEFS, and our method. In the AEFS, we set the size of the neurons in the hidden layer to 256. We selected one hidden layer in CAE, and the activation function was selected as LeakyRelu(0.2). In QS, we utilize the grid-search strategy from $\left\{ 0.1, 0.2, 0.3, 0.4, 0.5\right\} $ for parameter $\zeta $ and from $\left\{ 2, 5, 10, 13, 20, 25\right\} $ for parameter $\varepsilon $. In particular, in our proposed method, we also use the grid-search strategy from $\left\{ 10^{-3}, 10^{-2}, \ldots , 10^{3}\right\} $ for parameters $\alpha $, $\beta $ and $\gamma $. We selected $k\in \left\{ 20, 40, \ldots , 300\right\} $ features respectively to conduct the experiments.

Considering Eq. (13) (in main body), the W is regularized to be sparse in rows, $||w_i||_2$ is most likely zero during training. To avoid this, we add a small positive constant $\epsilon $ close enough to infinitesimal, aiming to ensure $Q_{ii}$ to be differentiable. Subsequently, Q is transformed into $Q^{'}$ such that the i-th elements are defined as

$$\begin{aligned} Q_{ii}^{'} = \frac{1}{2\sqrt{w_i^Tw_i+\epsilon }} (i=1,2, \cdots , d) \end{aligned}$$

(19)

Replacing Q with $Q^{'}$, Eq. (12) (in main body) can be written as follows:

$$\begin{aligned} \begin{array}{l} \dfrac{{\partial \mathcal {L}_{3}}}{{\partial W}} = \;\;2{a^{\left( 0 \right) }}{\left( {{\delta ^{\left( 1 \right) }}} \right) ^T} + 2\alpha WQ^{'} + 2\beta {X^T}{L_S}XW \end{array} \end{aligned}$$

(20)

C Visualization

The training results of the novel nonlinear self-representation are visualized in Fig. 4, where the input sample image is shown in Fig. 4. The reconstructed image is shown in Fig. 4(b). This shows that the nonlinear self-representation model effectively reconstructs the sample by preserving the intrinsic structure and the linear and nonlinear relationships among the original features. Figure 4(c) presents the importance $\Vert {{w}_{i}}\Vert $ of each feature i, which is reshaped by the shape of one sample in MNIST, such as that in the top left of Fig. 4(a). The 40 most important features of the MNIST dataset are presented in Fig. 4(d). From Fig. 4(c) and Fig. 4(d), the meaningful features are primarily concentrated in the middle of the image rather than the edges, which is highly consistent with the practical situation. Meanwhile, the region composed of features with high importance is similar to the shape of the number 3. These may imply that the top, bottom, middle, and right parts are the key to distinguishing between different numbers.

Table 3. Clustering performance (ACC% ± std) of different parts combination of our method on 8 datasets. The bold indicates the best results.

Full size table

Table 4. Clustering performance (NMI% ± std) of different parts combination of our method on 8 datasets. The bold indicates the best results.

Full size table

D Ablation Study

When it is necessary to understand the function of each part of the model, ablation experiments are commonly performed. We studied the role of each part of the proposed model by removing different parts. Each of the combinations shown in Tables 3 and 4 is presented as follows:

1.
Base linear model: The linear self-representation model with sparse constraints is as follows:
$$\begin{aligned} \mathcal {L}=\left\| X-XW\right\| _F^2+\lambda \Vert {W}\Vert _{2,1} \end{aligned}$$
(21)
2.
Nonlinear self-representation part (b): Basic model of nonlinear mapping based on the method of self-representation and added by sparse constraints. The loss function is
$$\begin{aligned} \mathop {\min }\limits _{W, \varTheta } \quad \mathcal {L}(W; \varTheta ) = \left\| {X - f\left( {g\left( {XW} \right) } \right) } \right\| _F^2 + \;\alpha {\left\| W \right\| _{2,1}} \end{aligned}$$
(22)
3.
$b+\beta $: Incorporated base nonlinear self-representation model with manifold learning that has a fixed similarity matrix.
$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{W, \varTheta } \quad \mathcal {L}(W; \varTheta ) = &{}\left\| {X - f\left( {g\left( {XW} \right) } \right) } \right\| _F^2 + \;\alpha {\left\| W \right\| _{2,1}}+ \beta Tr({W^T}{X^T}{L_S}XW) \end{aligned} \end{aligned}$$
(23)
4.
NRASP ($b+\beta +\gamma $): The whole proposed model by Eq. (5) (in main body).

As shown in Tables 3 and 4, we first consider the linear self-representation model as the basis on which we introduce nonlinear mapping method adopting the self-representation idea for comparison. It shows comparatively superior performance on most biological datasets, such as lymphoma, TOX_171, and GLIOMA, corroborating the better nonlinear learning ability of the nonlinear self-representation model than the linear model. However, the basic nonlinear self-representation model underperformed on the warpPIE10P image dataset but demonstrated even better performance when combined with manifold learning, which was conducted on all datasets. This indicates that for the information in the data, the nonlinear self-representation model can capture the nonlinear information well, while the structural information of the data may be omitted. Furthermore, when the similarity matrix is dynamically updated for manifold learning, it achieves excellent performance on all but the TOX_171 and HAR datasets, which shows that the structural information in the data is better captured. By comparing the combination of different parts, the NRASP can balance the different parts well and achieve superior performance.

E Detailed Results

Table 5. Clustering performance (NMI% ± std) of 9 FS algorithms on 8 datasets. The bold indicates the best results.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, A., Lin, L., Tian, P., Zhang, Q. (2024). Unsupervised Feature Selection via Nonlinear Representation and Adaptive Structure Preservation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14431. Springer, Singapore. https://doi.org/10.1007/978-981-99-8540-1_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-8540-1_12
Published: 25 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8539-5
Online ISBN: 978-981-99-8540-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Feature Selection via Nonlinear Representation and Adaptive Structure Preservation

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Derivation

1.1 A.1 Derivation of the Manifold Structure Preservation Item

1.2 A.2 Derivation of the KKT Condition

B Experiment Setup

C Visualization

D Ablation Study

E Detailed Results

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation