Unsupervised feature selection with robust data reconstruction (UFS-RDR) and outlier detection

doi:10.1016/j.eswa.2022.117008

Expert Systems with Applications

Volume 201, 1 September 2022, 117008

https://doi.org/10.1016/j.eswa.2022.117008 Get rights and content

Highlights

•
Feature selection methods in unsupervised learning are sensitive to outliers.
•
A novel unsupervised feature selection with robust data reconstruction is proposed.
•
It minimizes the graph regularized weighted data reconstruction error function.
•
This function down-weights the clustering observations having large distances.
•
The proposed method outperform the competitive methods in the presence of outliers.

Abstract

In unsupervised learning, the traditional feature selection methods are not always efficient and their feature selection performance can be severely affected in the presence of outliers and noise. To address this issue, we propose a novel robust unsupervised feature selection method, called Unsupervised Feature Selection with Robust Data Reconstruction (UFS-RDR), that minimizes the graph regularized weighted data reconstruction error function. For the detection of outliers, the well-known Mahalanobis distance is used and further determine the Huber-type weight function using these Mahalanobis distances. This weight function downweights the clustering observations that have large distance. Our experimental results on both synthetic and real-world datasets indicate that the proposed UFS-RDR approach has good feature selection performance and also outperforms the competitive non-robust unsupervised feature selection methods in the presence of contamination in the unlabeled data.

Introduction

With the recent technological revolution, a large amount of data with growing scale, diversity and complexity have been produced in many practical applications and analyzed in various contexts such as machine learning, pattern recognition, computer vision and data mining (Fan et al., 2014, Jiang et al., 2014, Jiang et al., 2016, Tang, Cao et al., 2018, Yuan et al., 2016). The huge sample size and high-dimensionality of features introduce specific computational and statistical challenges, including time complexity, memory load of the algorithms and computer hardware, degrades the performance of methods because of the curse of dimensionality and the presence of irrelevant, redundant and noisy features (Elhamifar and Vidal, 2013, Wahid et al., 2020). To handle these challenges, feature selection is a preliminary step for high-dimensional data processing in machine learning and data mining. The objective of feature selection is to identify a set of discriminative and salient features while maintaining the underline structure of the data and better generalization (Mitra, Murthy, & Pal, 2002).

In recent years, a class of feature selection techniques have been developed (Cong et al., 2015, Shang et al., 2019, Xiang et al., 2015). Based on the labeled information, feature selection techniques can be classified as supervised and unsupervised. In supervised learning, the class labels are specified and it is easy to define what relevant features means. For instance, a feature is said to be relevant to a class if it is highly correlated with the class label. The work in Hall (2000) uses a correlation based heuristic approach to assess the significance of features. The algorithm is simple, fast to execute and applicable on discrete and continuous class data sets. Another approach that selects the optimal feature subset based on a maximum weight and minimum redundancy (MWMR) was introduced in Wang, Wu, Kong, Li, and Zhang (2013). In this method, the aim is weighting each feature in such way to reflect its importance, where redundancy represents the correlations among the features. Moreover, in Peng, Long, and Ding (2005) another mutual information-based feature selection method called minimal-redundancy-maximal-relevance criterion (mRMR) was proposed.

On the other hand, in unsupervised learning, we observe only the features and the class labels are not available. Data clustering is one of the most widely used unsupervised learning method which is classify samples into different clusters. Since, the research on unsupervised feature selection is relatively recent and challenging problem, therefore, in this article, we concentrate on the issue of unsupervised feature selection which is a difficult problem because of the absence of labeled information.

Furthermore, unsupervised feature selection approaches are categorized into different types: filter methods (He et al., 2006, Mitra et al., 2002, Zhao and Liu, 2007), Wrapper methods (Tabakhi, Moradi, & Akhlaghian, 2014) and embedded methods (Hou et al., 2013, Hu et al., 2017, Li et al., 2012, Wang et al., 2015, Zhu et al., 2015). Filter models are independent of clustering algorithms, and they are computationally simple and fast. The Wrapper model requires one predetermined clustering algorithm and uses its performance as the evaluation criterion. The selected feature subset by Wrapper methods usually is more discriminative than filter methods, but Wrapper methods have high computational cost (Apiletti, Baralis, Bruno, & Fiori, 2012). Embedded feature selection approaches are more efficient than the other two in many aspects, and recently achieved more attentions. Algorithms based on the embedded model show that preserving global pairwise data points similarity and local geometric structure of data is of great importance for feature selection. Preserving locality structure of data points becomes clearly more important than preserving global pairwise data similarity for unsupervised feature selection (Fang et al., 2014). The most widely employed local geometrical structure protecting method is the graph Laplacian regularization (Zhao et al., 2015).

Recently, many unsupervised feature selection algorithms that perfectly protected the local shape of the basic feature space (He, Zhu, Cheng, Hu, & Zhang, 2017) and clustering performance optimization (Zhu et al., 2015) are proposed. These approaches are based on the concept of locality and similarity preservation, which means that if two data points $x_{i}$ and $x_{j}$ are adjacent in elemental geometry of the data distribution then these data points are also adjoining to each other on the chosen set of features. However, the clustering efficiency optimization methods identify salient features that can optimize specific objective functions. For instance, Zhao et al. (2015) propose a graph regularized feature selection method with data reconstruction called GRFS. They use the concept of data reconstruction via linear combination by minimizing the reconstruction error to choose informative features from the original data. They show that their approach is an essential criterion for quantifying the importance of the selected features. Further feature selection learning methods can be found in Cheng et al., 2017, Li et al., 2020 and Zhang, Yang, Deng, Cheng, and Li (2017).

Although the aforementioned feature selection methods work well on clean data, none of these methods can resist the adverse effect of outliers and noise in the data. The GRFS method is based on the Frobenius norm, but it is important to note that the Frobenius norm is highly sensitive to outliers and influential observations, resulting unsatisfactory feature selection performance. In order to address this issue, in this article, we propose a novel unsupervised feature selection approach via weighted reconstruction error and graph regularization, which can achieve robust feature selection while the local geometrical data structure are preserved. Specifically, first we detect the multivariate outliers by using distance based approach and construct a proper weight function that downweighted the cases with large outliers. Second, we integrates the weighted reconstruction error function and graph regularization into a joint framework that deals with the problem of outliers or noise in data for unsupervised feature selection. Clustering performance with the selected features on the synthetic data and six benchmark data sets demonstrates the advantages over competitive methods of our algorithm.

The major contributions of this work are summarized as under:

1.
We first calculate the Mahalanobis distance for every data point $x_{j}$ (for $j = 1, 2, \dots, m$ ). Then a Huber-type weight function is constructed that depend on the Mahalanobis distance and a threshold parameter. As a function of Mahalanobis distance, the weights are close to zero for data points with large distance. The threshold parameter determines the proportion of observations being downweighted.
2.
The proposed UFS-RDR algorithm unifies the above idea with data reconstruction (Zhao et al., 2015) and builds weighted data reconstruction error, which is robust against outliers and noises in unlabeled data. UFS-RDR uses graph regularization and sparsity induced $l_{1}$ -norm on diagonal matrix $Λ$ to preserve local structure of the original data and select features, respectively.

The rest of this article is organized as follows. Section 2 reviews the related unsupervised feature selection methods. The proposed model is presented in Section 3. Section 4 provides the optimization algorithms for solving proposed method. In Section 5 we present experimental results on both synthetic and real-world data sets while Section 6 contains conclusions.

Section snippets

Related work

During the last two decades, several unsupervised feature selection algorithms have been proposed in the literature. Unsupervised feature selection is much complicated problem due to the non-availability of class labels that would help the search for informative feature. Traditional methods such as maximum variance and Laplacian score (He et al., 2006) based on feature ranking in the ascending order, and top features are selected. However, it ignore the potential interactions among features in

Notations

We first introduce some notations utilized throughout this article. Let $X \in R^{n \times m}$ denote a data matrix, where $m$ and $n$ represent the number of clustering observations and features, respectively, and $F = X^{T}$ is the feature matrix. We use the notation $f_{i}^{T} \in R^{m}$ , for $i = 1, 2, \dots, n$ , to denote the row vectors of data matrix $X$ , each one of which corresponds to a feature, and we let $f_{i} = {(x_{i 1}, x_{i 2}, \dots, x_{i m})}^{T}$ denote the projection of every data point on the $i$ th feature. Throughout the paper, $C = (c_{1}, c_{2}, \dots, c_{n}) \in R^{n \times n}$ and $A$

Optimization and algorithms

In this section, an iterative optimization algorithm to solve problem (9) is presented. We used the optimization procedure described in Lee, Battle, Raina, and Ng (2007). We optimize the criterion (9) in two steps: first, learning feature selection matrix $Λ$ while holing reconstruction coefficient matrix $C$ fixed; second, learning matrix $C$ while holding feature selection matrix $Λ$ fixed.

Experiments and analysis

In this section, four competing methods are compared, including Baseline, sparcl (Witten & Tibshirani, 2010), GRFS (Zhao et al., 2015) and the proposed UFS-RDR algorithm. All features are used as the Baseline approach in each experiment. First, a feature subset is selected by the methods, and then K-means clustering algorithm is employed to obtain the clustering outcomes. All experiments are conducted by implementing $R$ 3.4.0, on a personal computer.

Conclusions

In this article, we propose an unsupervised feature selection method, namely UFS-RDR, to deal with outliers and noisy data in cluster analysis. The UFS-RDR unifies the ideas of weighted data reconstruction error and graph regularization for robustly preserving discriminant information and local similarity of the data. Our experiments on synthetic data showed that the proposed model becomes more effective in feature selection and clustering accuracy with increasing proportion of outliers. The

CRediT authorship contribution statement

Abdul Wahid: Conceptualization, Methodology, Data curation, Software, Writing – original draft. Dost Muhammad Khan: Data curation, Writing – original draft, Supervision, Writing – reviewing and editing. Ijaz Hussain: Visualization, Investigation. Sajjad Ahmad Khan: Visualization, Validation. Zardad Khan: Software, Validation, Reviewing and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the editor-in-chief, associate editor and two anonymous reviewers for their insightful and helpful comments and suggestions, which resulted in substantial improvements to this work.

References (57)

CongYang et al.
Deep sparse feature selection for computer aided endoscopy diagnosis
Pattern Recognition
(2015)
DuShiqiang et al.
Robust unsupervised feature selection via matrix factorization
Neurocomputing
(2017)
FangXiaozhao et al.
Locality and similarity preserving embedding for feature selection
Neurocomputing
(2014)
HeWei et al.
Unsupervised feature selection for visual classification via feature-representation property
Neurocomputing
(2017)
HuRongyao et al.
Graph self-representation method for unsupervised feature selection
Neurocomputing
(2017)
JiangMon-Fong et al.
Two-phase clustering process for outliers detection
Pattern Recognition Letters
(2001)
LeysChristophe et al.
Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance
Journal of Experimental Social Psychology
(2018)
LiYangxi et al.
Manifold regularized multi-view feature selection for social image annotation
Neurocomputing
(2016)
LiuYanbei et al.
Unsupervised feature selection via diversity-induced self-representation
Neurocomputing
(2017)
ShangRonghua et al.
Self-representation based dual-graph regularized feature selection clustering
Neurocomputing
(2016)

TabakhiSina et al.

An unsupervised feature selection algorithm based on ant colony optimization

Engineering Applications of Artificial Intelligence

(2014)

TangChang et al.

Robust unsupervised feature selection via dual self-representation and manifold regularization

Knowledge-Based Systems

(2018)

TangChang et al.

Robust graph regularized unsupervised feature selection

Expert Systems with Applications

(2018)

WahidAbdul et al.

Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule

Chemometrics and Intelligent Laboratory Systems

(2020)

WangShiping et al.

Unsupervised feature selection via low-rank approximation and structure learning

Knowledge-Based Systems

(2017)

WangJianzhong et al.

Maximum weight and minimum redundancy: a novel framework for feature subset selection

Pattern Recognition

(2013)

XiangShuo et al.

Efficient nonconvex sparse group feature selection via continuous and discrete optimization

Artificial Intelligence

(2015)

ZhangZhao et al.

Similarity preserving low-rank representation for enhanced data representation and effective subspace learning

Neural Networks

(2014)

ZhouNan et al.

Global and local structure preserving sparse subspace learning: An iterative approach to unsupervised feature selection

Pattern Recognition

(2016)

ZhuPengfei et al.

Unsupervised feature selection by regularized self-representation

Pattern Recognition

(2015)

ApilettiDaniele et al.

Maskedpainter: feature selection for microarray data analysis

Intelligent Data Analysis

(2012)

BelkinMikhail et al.

Laplacian eigenmaps and spectral techniques for embedding and clustering

BelkinMikhail et al.

Manifold regularization: A geometric framework for learning from labeled and unlabeled examples

Journal of Machine Learning Research

(2006)

Cai, Deng, Zhang, Chiyuan, & He, Xiaofei (2010). Unsupervised feature selection for multi-cluster data. In Proceedings...

ChengDebo et al.

Feature selection by combining subspace learning with sparse representation

Multimedia Systems

(2017)

ElhamifarEhsan et al.

Sparse subspace clustering: Algorithm, theory, and applications

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2013)

FanJianqing et al.

Challenges of big data analysis

National Science Review

(2014)

HallMark A.

Correlation-based feature selection of discrete and numeric class machine learning

(2000)

Cited by (9)

A novel autoencoder for structural anomalies detection in river tunnel operation
2024, Expert Systems with Applications
Anomaly diagnosis is essential to prevent disasters and ensure long-term stable operation of tunnels. However, the diversity and scarcity of abnormal data make it difficult to identify outliers, especially to diagnose structural anomalies from poor-quality data. Therefore, an adaptive loss function improved autoencoder (AdaAE) model is proposed for anomaly detection, which is robust to poor-quality data and can accurately determine the anomaly source of river tunnel. To expand the abnormal dataset, numerical simulation of structure under extreme conditions and Gaussian noise are adopted to construct structural damage data and disturbance data respectively. The proposed model is then instantiated on the prepared dataset. Finally, the reliability and the advantage of the proposed model are verified by ablation study. The research results indicate that the detection ability of AdaAE model is greatly improved to that of the widely used methods. This model is suitable to poor quality dataset, and the accuracy to detect structural anomalies from pollution data sets is more than 90%. As a case study, the AdaAE model is applied to the Wuhan Yangtze River tunnel to detect anomalies of segment strain. This study would play a role in preventing structural diseases and promoting intelligent management during tunnel operation.
Feature selections based on three improved condition entropies and one new similarity degree in interval-valued decision systems
2023, Engineering Applications of Artificial Intelligence
Feature selections facilitate classification learning in various data environments. Aiming at interval-valued decision systems (IVDSs), feature selections rely on information measures and similarity degrees, whereas current selection algorithms on credibility-based condition entropy and classical similarity degree are accompanied with some measurement limitations and advancement space. In this paper based on IVDSs, three coverage-credibility-based condition entropies and one geometry-probabilistic similarity degree are proposed across two dimensions of informationization and granulation, and they improve the existing condition entropy and similarity degree; accordingly, 4 × 2 feature selections emerge for optimization and applicability, and they systematically contain one initial selection algorithm and seven new/robuster algorithms. At first, three-way granular measures (i.e., credibility, coverage, and integrated coverage-credibility) are formulated in IVDSs, and three novel condition entropies are established by implementing three information structures on coverage-credibility. These condition entropies acquire in-depth improvements, hierarchical algorithms, size relationships, maximum/minimum conditions, and granulation non-monotonicity. Then, the probabilistic similarity degree is defined by a six-piecewise function with quadratic factors, and this new measure gains the geometry-probability mechanism and high-quality improvement. Furthermore, feature selections are determined by preserving condition entropies and by mining feature significances, so eight selection algorithms are obtained by combining condition entropies and similarity degrees. Finally, data experiments are performed to validate relevant uncertainty measures and feature selections, and seven constructional selection algorithms outperform three contrastive algorithms to achieve better classification performances.
Explicit Unsupervised Feature Selection Based on Structured Graph and Locally Linear Embedding
2023, SSRN
Early Thyroid Risk Prediction by Data Mining and Ensemble Classifiers
2023, Machine Learning and Knowledge Extraction
Feature selection using class-level regularized self-representation
2023, Applied Intelligence
Completed sample correlations and feature dependency-based unsupervised feature selection
2023, Multimedia Tools and Applications

View all citing articles on Scopus

View full text

Unsupervised feature selection with robust data reconstruction (UFS-RDR) and outlier detection

Highlights

Abstract

Introduction

Section snippets

Related work

Notations

Optimization and algorithms

Experiments and analysis

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Pattern Recognition

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Pattern Recognition Letters

Journal of Experimental Social Psychology

Neurocomputing

Neurocomputing

Neurocomputing

Engineering Applications of Artificial Intelligence

Knowledge-Based Systems

Expert Systems with Applications

Chemometrics and Intelligent Laboratory Systems

Knowledge-Based Systems

Pattern Recognition

Artificial Intelligence

Neural Networks

Pattern Recognition

Pattern Recognition

Maskedpainter: feature selection for microarray data analysis

Intelligent Data Analysis

Laplacian eigenmaps and spectral techniques for embedding and clustering

Manifold regularization: A geometric framework for learning from labeled and unlabeled examples

Journal of Machine Learning Research

Feature selection by combining subspace learning with sparse representation

Multimedia Systems

Sparse subspace clustering: Algorithm, theory, and applications

IEEE Transactions on Pattern Analysis and Machine Intelligence

Challenges of big data analysis

National Science Review

Correlation-based feature selection of discrete and numeric class machine learning