Unsupervised feature selection with robust data reconstruction (UFS-RDR) and outlier detection

https://doi.org/10.1016/j.eswa.2022.117008Get rights and content

Highlights

  • Feature selection methods in unsupervised learning are sensitive to outliers.

  • A novel unsupervised feature selection with robust data reconstruction is proposed.

  • It minimizes the graph regularized weighted data reconstruction error function.

  • This function down-weights the clustering observations having large distances.

  • The proposed method outperform the competitive methods in the presence of outliers.

Abstract

In unsupervised learning, the traditional feature selection methods are not always efficient and their feature selection performance can be severely affected in the presence of outliers and noise. To address this issue, we propose a novel robust unsupervised feature selection method, called Unsupervised Feature Selection with Robust Data Reconstruction (UFS-RDR), that minimizes the graph regularized weighted data reconstruction error function. For the detection of outliers, the well-known Mahalanobis distance is used and further determine the Huber-type weight function using these Mahalanobis distances. This weight function downweights the clustering observations that have large distance. Our experimental results on both synthetic and real-world datasets indicate that the proposed UFS-RDR approach has good feature selection performance and also outperforms the competitive non-robust unsupervised feature selection methods in the presence of contamination in the unlabeled data.

Introduction

With the recent technological revolution, a large amount of data with growing scale, diversity and complexity have been produced in many practical applications and analyzed in various contexts such as machine learning, pattern recognition, computer vision and data mining (Fan et al., 2014, Jiang et al., 2014, Jiang et al., 2016, Tang, Cao et al., 2018, Yuan et al., 2016). The huge sample size and high-dimensionality of features introduce specific computational and statistical challenges, including time complexity, memory load of the algorithms and computer hardware, degrades the performance of methods because of the curse of dimensionality and the presence of irrelevant, redundant and noisy features (Elhamifar and Vidal, 2013, Wahid et al., 2020). To handle these challenges, feature selection is a preliminary step for high-dimensional data processing in machine learning and data mining. The objective of feature selection is to identify a set of discriminative and salient features while maintaining the underline structure of the data and better generalization (Mitra, Murthy, & Pal, 2002).

In recent years, a class of feature selection techniques have been developed (Cong et al., 2015, Shang et al., 2019, Xiang et al., 2015). Based on the labeled information, feature selection techniques can be classified as supervised and unsupervised. In supervised learning, the class labels are specified and it is easy to define what relevant features means. For instance, a feature is said to be relevant to a class if it is highly correlated with the class label. The work in Hall (2000) uses a correlation based heuristic approach to assess the significance of features. The algorithm is simple, fast to execute and applicable on discrete and continuous class data sets. Another approach that selects the optimal feature subset based on a maximum weight and minimum redundancy (MWMR) was introduced in Wang, Wu, Kong, Li, and Zhang (2013). In this method, the aim is weighting each feature in such way to reflect its importance, where redundancy represents the correlations among the features. Moreover, in Peng, Long, and Ding (2005) another mutual information-based feature selection method called minimal-redundancy-maximal-relevance criterion (mRMR) was proposed.

On the other hand, in unsupervised learning, we observe only the features and the class labels are not available. Data clustering is one of the most widely used unsupervised learning method which is classify samples into different clusters. Since, the research on unsupervised feature selection is relatively recent and challenging problem, therefore, in this article, we concentrate on the issue of unsupervised feature selection which is a difficult problem because of the absence of labeled information.

Furthermore, unsupervised feature selection approaches are categorized into different types: filter methods (He et al., 2006, Mitra et al., 2002, Zhao and Liu, 2007), Wrapper methods (Tabakhi, Moradi, & Akhlaghian, 2014) and embedded methods (Hou et al., 2013, Hu et al., 2017, Li et al., 2012, Wang et al., 2015, Zhu et al., 2015). Filter models are independent of clustering algorithms, and they are computationally simple and fast. The Wrapper model requires one predetermined clustering algorithm and uses its performance as the evaluation criterion. The selected feature subset by Wrapper methods usually is more discriminative than filter methods, but Wrapper methods have high computational cost (Apiletti, Baralis, Bruno, & Fiori, 2012). Embedded feature selection approaches are more efficient than the other two in many aspects, and recently achieved more attentions. Algorithms based on the embedded model show that preserving global pairwise data points similarity and local geometric structure of data is of great importance for feature selection. Preserving locality structure of data points becomes clearly more important than preserving global pairwise data similarity for unsupervised feature selection (Fang et al., 2014). The most widely employed local geometrical structure protecting method is the graph Laplacian regularization (Zhao et al., 2015).

Recently, many unsupervised feature selection algorithms that perfectly protected the local shape of the basic feature space (He, Zhu, Cheng, Hu, & Zhang, 2017) and clustering performance optimization (Zhu et al., 2015) are proposed. These approaches are based on the concept of locality and similarity preservation, which means that if two data points xi and xj are adjacent in elemental geometry of the data distribution then these data points are also adjoining to each other on the chosen set of features. However, the clustering efficiency optimization methods identify salient features that can optimize specific objective functions. For instance, Zhao et al. (2015) propose a graph regularized feature selection method with data reconstruction called GRFS. They use the concept of data reconstruction via linear combination by minimizing the reconstruction error to choose informative features from the original data. They show that their approach is an essential criterion for quantifying the importance of the selected features. Further feature selection learning methods can be found in Cheng et al., 2017, Li et al., 2020 and Zhang, Yang, Deng, Cheng, and Li (2017).

Although the aforementioned feature selection methods work well on clean data, none of these methods can resist the adverse effect of outliers and noise in the data. The GRFS method is based on the Frobenius norm, but it is important to note that the Frobenius norm is highly sensitive to outliers and influential observations, resulting unsatisfactory feature selection performance. In order to address this issue, in this article, we propose a novel unsupervised feature selection approach via weighted reconstruction error and graph regularization, which can achieve robust feature selection while the local geometrical data structure are preserved. Specifically, first we detect the multivariate outliers by using distance based approach and construct a proper weight function that downweighted the cases with large outliers. Second, we integrates the weighted reconstruction error function and graph regularization into a joint framework that deals with the problem of outliers or noise in data for unsupervised feature selection. Clustering performance with the selected features on the synthetic data and six benchmark data sets demonstrates the advantages over competitive methods of our algorithm.

The major contributions of this work are summarized as under:

  • 1.

    We first calculate the Mahalanobis distance for every data point xj (for j=1,2,,m). Then a Huber-type weight function is constructed that depend on the Mahalanobis distance and a threshold parameter. As a function of Mahalanobis distance, the weights are close to zero for data points with large distance. The threshold parameter determines the proportion of observations being downweighted.

  • 2.

    The proposed UFS-RDR algorithm unifies the above idea with data reconstruction (Zhao et al., 2015) and builds weighted data reconstruction error, which is robust against outliers and noises in unlabeled data. UFS-RDR uses graph regularization and sparsity induced l1-norm on diagonal matrix Λ to preserve local structure of the original data and select features, respectively.

The rest of this article is organized as follows. Section 2 reviews the related unsupervised feature selection methods. The proposed model is presented in Section 3. Section 4 provides the optimization algorithms for solving proposed method. In Section 5 we present experimental results on both synthetic and real-world data sets while Section 6 contains conclusions.

Section snippets

Related work

During the last two decades, several unsupervised feature selection algorithms have been proposed in the literature. Unsupervised feature selection is much complicated problem due to the non-availability of class labels that would help the search for informative feature. Traditional methods such as maximum variance and Laplacian score (He et al., 2006) based on feature ranking in the ascending order, and top features are selected. However, it ignore the potential interactions among features in

Notations

We first introduce some notations utilized throughout this article. Let XRn×m denote a data matrix, where m and n represent the number of clustering observations and features, respectively, and F=XT is the feature matrix. We use the notation fiTRm, for i=1,2,,n, to denote the row vectors of data matrix X, each one of which corresponds to a feature, and we let fi=(xi1,xi2,,xim)T denote the projection of every data point on the ith feature. Throughout the paper, C=(c1,c2,,cn)Rn×n and A

Optimization and algorithms

In this section, an iterative optimization algorithm to solve problem (9) is presented. We used the optimization procedure described in Lee, Battle, Raina, and Ng (2007). We optimize the criterion (9) in two steps: first, learning feature selection matrix Λ while holing reconstruction coefficient matrix C fixed; second, learning matrix C while holding feature selection matrix Λ fixed.

Experiments and analysis

In this section, four competing methods are compared, including Baseline, sparcl (Witten & Tibshirani, 2010), GRFS (Zhao et al., 2015) and the proposed UFS-RDR algorithm. All features are used as the Baseline approach in each experiment. First, a feature subset is selected by the methods, and then K-means clustering algorithm is employed to obtain the clustering outcomes. All experiments are conducted by implementing R 3.4.0, on a personal computer.

Conclusions

In this article, we propose an unsupervised feature selection method, namely UFS-RDR, to deal with outliers and noisy data in cluster analysis. The UFS-RDR unifies the ideas of weighted data reconstruction error and graph regularization for robustly preserving discriminant information and local similarity of the data. Our experiments on synthetic data showed that the proposed model becomes more effective in feature selection and clustering accuracy with increasing proportion of outliers. The

CRediT authorship contribution statement

Abdul Wahid: Conceptualization, Methodology, Data curation, Software, Writing – original draft. Dost Muhammad Khan: Data curation, Writing – original draft, Supervision, Writing – reviewing and editing. Ijaz Hussain: Visualization, Investigation. Sajjad Ahmad Khan: Visualization, Validation. Zardad Khan: Software, Validation, Reviewing and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the editor-in-chief, associate editor and two anonymous reviewers for their insightful and helpful comments and suggestions, which resulted in substantial improvements to this work.

References (57)

  • TabakhiSina et al.

    An unsupervised feature selection algorithm based on ant colony optimization

    Engineering Applications of Artificial Intelligence

    (2014)
  • TangChang et al.

    Robust unsupervised feature selection via dual self-representation and manifold regularization

    Knowledge-Based Systems

    (2018)
  • TangChang et al.

    Robust graph regularized unsupervised feature selection

    Expert Systems with Applications

    (2018)
  • WahidAbdul et al.

    Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule

    Chemometrics and Intelligent Laboratory Systems

    (2020)
  • WangShiping et al.

    Unsupervised feature selection via low-rank approximation and structure learning

    Knowledge-Based Systems

    (2017)
  • WangJianzhong et al.

    Maximum weight and minimum redundancy: a novel framework for feature subset selection

    Pattern Recognition

    (2013)
  • XiangShuo et al.

    Efficient nonconvex sparse group feature selection via continuous and discrete optimization

    Artificial Intelligence

    (2015)
  • ZhangZhao et al.

    Similarity preserving low-rank representation for enhanced data representation and effective subspace learning

    Neural Networks

    (2014)
  • ZhouNan et al.

    Global and local structure preserving sparse subspace learning: An iterative approach to unsupervised feature selection

    Pattern Recognition

    (2016)
  • ZhuPengfei et al.

    Unsupervised feature selection by regularized self-representation

    Pattern Recognition

    (2015)
  • ApilettiDaniele et al.

    Maskedpainter: feature selection for microarray data analysis

    Intelligent Data Analysis

    (2012)
  • BelkinMikhail et al.

    Laplacian eigenmaps and spectral techniques for embedding and clustering

  • BelkinMikhail et al.

    Manifold regularization: A geometric framework for learning from labeled and unlabeled examples

    Journal of Machine Learning Research

    (2006)
  • Cai, Deng, Zhang, Chiyuan, & He, Xiaofei (2010). Unsupervised feature selection for multi-cluster data. In Proceedings...
  • ChengDebo et al.

    Feature selection by combining subspace learning with sparse representation

    Multimedia Systems

    (2017)
  • ElhamifarEhsan et al.

    Sparse subspace clustering: Algorithm, theory, and applications

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2013)
  • FanJianqing et al.

    Challenges of big data analysis

    National Science Review

    (2014)
  • HallMark A.

    Correlation-based feature selection of discrete and numeric class machine learning

    (2000)
  • Cited by (9)

    • Early Thyroid Risk Prediction by Data Mining and Ensemble Classifiers

      2023, Machine Learning and Knowledge Extraction
    View all citing articles on Scopus
    View full text