Kernel sparse modeling for prototype selection

doi:10.1016/j.knosys.2016.05.058

Knowledge-Based Systems

Volume 107, 1 September 2016, Pages 61-69

https://doi.org/10.1016/j.knosys.2016.05.058 Get rights and content

Abstract

Recently, a new method termed Sparse Modeling Representative Selection (SMRS) has been proposed for selecting the most relevant instances in datasets. SMRS is based on data self-representativeness in the sense that it estimates a coding matrix using a dictionary of samples set to the data themselves. Sample relevances are derived from the matrix of coefficients with a block sparsity constraint. Due to the use of a linear model for data self-representation, SMRS cannot always provide good relevant samples. Besides, most of the selected relevant samples by SMRS are in dense regions. In this paper, we propose to overcome the shortcomings of the SMRS method by deploying non-linear data self-representativeness through the use of two kinds of data projections: kernel trick and column generation. Qualitative evaluation is performed on summarizing two video movies. Quantitative evaluations are obtained by performing classification tasks on the summarized training image datasets where the objective is to compare the relevance of selected samples for a given classification task and for a given instance selection method. The conducted experiments showed that the proposed methods can outperform state-of-the art methods including the SMRS method.

Introduction

Estimating a subset of prototypes, known as representatives, that can efficiently and reliably describe the whole dataset, is an important issue in the analysis of scientific data. It has a lot of applications in machine learning, data recovery, image processing, etc. Due to the effectiveness of prototype selection for speeding up training processes, many methods have been proposed [1], [2], [3], [4], [5], [6], [7]. The selected representatives can summarize datasets of images, videos, texts or Web documents. Finding a small number of prototypes which replaces the learning database has two main advantages: (i) reducing the memory space needed to store data, and (ii) improving the computation time of classification algorithms. For example, the Nearest Neighbor (NN) classifier is more efficient [8] when comparing test samples to few representatives rather than to all training samples. A reduced training dataset can also speed up the training process in the sense that the classifier training becomes less computationally expensive. For pattern recognition tasks, it is also required that the overall performance will not be considerably affected by the data reduction. The problem can be stated as follows: given a training set T, the goal of a prototype selection method is to obtain a subset S ⊆ T such that S does not contain superfluous prototypes and Acc(S) ≃ Acc(T) where Acc(S) is the classification accuracy obtained using the subset S as training set. Instance selection methods can either start with $S = \emptyset$ (incremental methods) or $S = T$ (decremental methods). The difference is that the incremental methods add prototypes in S during the selection process and decremental methods remove prototypes from S along the selection.

Standard dictionary learning algorithms (e.g, [9], [10], [11]) calculate items in a continuous solution space, i.e. they select representatives that mixes data samples and almost never coincide with original dataset. Thus, dictionary learning may not be suitable for tasks like video summarization and data sample selection. On the other hand, prototype selection provides a discrete dictionary where the atoms are chosen from the original dataset.

Like in feature selection, according to the strategy used for selecting prototypes, we can divide the prototype selection methods into two groups: (i) Wrapper methods in which the selection criterion is based on the accuracy obtained by a classifier (commonly, those prototypes that do not contribute to the classification accuracy are discarded from the training set) (e.g. [12], [13]), and (ii) Filter methods in which the selection criterion uses a selection function which is not based on a classifier (e.g., [14]). A good review on wrapper and filter methods can be found in [15]. In [16], the authors proposed a wrapper technique that exploits a new two layer genetic algorithm based on a divide and conquer partition strategy. Many prototype selection algorithms (e.g., [17], [18]) are strongly related to the use of the k-NN classifier. One can also find prototype selection algorithms that do not restrict the use of a specific classifier. Examples of this kind of algorithms are the evolutionary ones (e.g., [19]), which use the accuracy of a classifier as selection criterion. In these algorithms, a prototype is deleted whenever it does not contribute for either maintaining or improving the classification accuracy. For instance, the work of [19] proposed evolutionary algorithms for selecting prototypes. The memetic algorithms combine evolutionary algorithms and local search, within the evolutionary cycle; a local search (among the chromosomes) is carried out in order to improve the accuracy and reducing the size of the solutions.

Some works address the instance selection for regression tasks (e.g., [20]). In [21], the authors proposed boosting instance selection algorithms by considering instance selection as a binary classification problem: selected or unselected. With this view of instance selection, the philosophy of boosting and constructing ensembles of instance selectors was possible. Several rounds of an instance selection procedure are performed on different samples from the training set. In [22], the author proposed an ensemble method for instance selection. The framework uses the Edited Nearest Neighbor or the Condensed Nearest Neighbor algorithm over several subsets of features. The final instance selection is then provided by a voting scheme.

The filter algorithms can be divided into two main groups. The first category finds representatives from data contained in one or several subspaces of reduced dimensionality. For instance, the algorithm Rank Revealing QR (RRQR) [23] tries to select a few data points through finding a permutation of the data which gives the best conditioned sub-matrix. Greedy and Randomized algorithms have also been proposed in order to find a subset of columns in a reduced rank matrix [1], [3], [24]. In [25], the authors use the multi-label Edited Nearest Neighbor prototype selection algorithm in order to annotate large image datasets using the Kernel Extreme Machine Learning. In [26], the authors propose an instance selection method that is based on data self-representativeness adopting a block sparsity constraint. In [27], the authors propose a weighted and recursive variant of the method proposed in [26]. It was shown that a recursive instance elimination together with a weighted coding can lead to better performance for the instance selection.

The second category finds representatives assuming there is a natural grouping of data collection based on an appropriate measure of similarity between pairs of data points [28], [29], [30]. Accordingly, these algorithms generally work on the similarity/dissimilarity between data points to be grouped. The Kmedoids algorithm [28], which can be considered as a variant of Kmeans [31], supposes that the data are located around several centers of classes, called medoids, which are selected from the data. Another algorithm based on the similarity/dissimilarity of data points is the Affinity propagation (AP) [30]. This algorithm tries to find representatives from the similarities between pairs of data points by using a message passing algorithm. Although AP has suboptimal properties and can find approximate solutions, it does not require any initialization (like Kmeans and Kmedoids) [32].

Recently, a new filter method called Sparse Modeling Representative Selection (SMRS) [26], has been proposed to find sample representatives. It is based on setting every data sample as a linear combination of the whole dataset with a block-sparsity constraint. SMRS is essentially based on the assumption that for each sample in the dataset there exist some samples that form a linear subspace that the sample belongs to or very close to it. This assumption is very similar to the one used by the Locally Linear Embedding (LLE) technique where each sample is assumed to be an affine combination of neighboring samples [33]. LLE is a nonlinear dimension reduction approach and it supposes that each data point and its neighbors are lying on or close to a locally linear manifold.

In SMRS method, the whole dataset is used as a dictionary and the block sparsity is imposed on the matrix of coding coefficients in order to enhance the relevance of samples. SMRS suffers from at least two shortcomings. First, the linear assumption can be violated. Indeed, real-world data usually have non-linear distributions such that, even in local neighborhoods, a linear subspace can be a rough approximation. Second, due to the use of a linear model for large dictionaries, samples belonging to dense regions in data space will have large coefficients. As a consequence, SMRS tends to select the majority of relevant instances in dense regions. This can be undesirable from the point of view of classification where the presence of samples at class borders can enhance the discrimination between different classes.

In this paper, we propose to overcome these shortcomings. We will use kernel sparse subspace modeling where the linear combination is performed in high dimensional space in the hope that sample relevance can be better captured in these high dimensional spaces. In the literature, the kernel trick has been used to get the non-linear version of well-known linear classifiers such as Support Vector Machines (SVM). The use of kernels was also useful for getting the non-linear variant of many linear embedding techniques such as Linear Discriminant Analysis (LDA) and Local Discriminant Embedding (LDE). In [34], we proposed a coding scheme that can estimate the coding matrix using data self-representativeness. The approach estimates the coding of every sample with respect to the dataset using a two phase collaborative neighbor coding that imposes implicit and explicit sparsity on the coding coefficients. There are two main differences between [34] and the current work. Firstly, the work in [34] adopts a linear model for data self-representativeness whereas the current work uses non-linear models induced by kernels. Secondly, the work of [34] utilizes one independent coding for every sample, which cannot be as efficient as the estimation of one single coding matrix.

The paper is structured as follows: Section 2 presents a brief review of the SMRS method. In Section 3, we describe our proposed kernel methods. Section 4 presents a qualitative evaluation on video summarization and a quantitative evaluation that quantifies the classification performance based on the selected prototypes. Finally, we provide some concluding remarks in Section 5. In the sequel, capital bold letters denote matrices and small bold letters denote vectors.

Section snippets

Review of sparse modeling representative selection (SMRS)

In this section, we briefly describe the Sparse Modeling Representative Selection (SMRS) method proposed in [26]. The problem formulation can be stated as follows. Consider a set of data samples $T = {y_{1}, \dots, y_{N}}$ in $R^{d}$ arranged as the columns of the data matrix $Y = [y_{1}, \dots, y_{N}],$ where d denotes the sample dimension. The objective is to select the most representative samples within the set of samples T. SMRS is a filter method that uses the concept of relevance ranking. In other words, the relevance score

Proposed kernel sparse modeling

In this section, we introduce two kernelized sparse modeling selection schemes. The first one adopts the projection onto Hilbert space [36]. The second uses the trick of column generation. In fact, in several important problems in computer vision such as face recognition and activity recognition, the data can be well approximated by a union of subspaces. It is very important to notice that theses subspaces are observed in the ambient space in which data distribution can have high non-linearity.

Performance evaluation

In this section, we first provide a qualitative evaluation of the SMRS method and the proposed method when applied to video sequence summarization. We then provide quantitative comparison provided by classification results over four benchmark image datasets after selecting training representatives using different competing selection methods and different classifiers.

Discussions and conclusions

We proposed a Kernel Sparse Modeling Representation method for finding representatives in a given set of data samples. We compared our proposed methods with several competing methods used in the domain. Experimental results on several public databases and a video movie are presented to demonstrate the efficacy of the proposed approaches. The databases correspond to images, which makes the selection and classification more challenging. The classification results, conducted with three different

References (44)

LiJ. et al.
A new fast reduction technique based on binary nearest neighbor tree
Neurocomputing
(2015)
A. Arnaiz-Gonzalez et al.
Fusion of instance selection methods in regression tasks
Inf. Fus.
(2016)
N. Garcia-Pedrajas et al.
Boosting instance selection algorithms
Knowl.-Based Syst.
(2014)
M. Blachnik
Ensembles of instance selection methods based on feature subset
Procedia Comput. Sci.
(2014)
J. Waqas et al.
Collaborative neighbor representation based classification using l₂-minimization approach
Pattern Recognit. Lett.
(2013)
J. Bien et al.
CUR from a sparse optimization viewpoint
Advances in Neural Information Processing Systems
(2010)
LiY. et al.
Selecting critical patterns based on local geometrical and statistical information
IEEE Trans. Pattern Anal. Mach. Intell.
(2011)
J. Tropp
Column subset selection, matrix factorization and eigenvalue optimization
Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA)
(2009)
N. Garcia-Pedrajas et al.
Oligois: Scalable instance selection for class-imbalanced data sets
IEEE Trans. Cybern.
(2013)
E. Elhamifar et al.
Dissimilarity-based sparse subset selection
IEEE Trans. Pattern Anal. Mach. Intell.
(2015)

J. Calvo-Zaragoza et al.

Improving kNN multi-label classification in prototype selection scenarios using class proposals

Pattern Recognit.

(2015)

S. Garcia et al.

Prototype selection for nearest neighbor classification: Taxonomy and empirical study

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

M. Aharon et al.

K-svd: An algorithm for designing overcomplete dictionaries for sparse representation

IEEE Trans. Signal Process.

(2006)

JiangZ. et al.

Learning a discriminative dictionary for sparse coding via label consistent k-svd

IEEE Conference on Computer Vision and Pattern Recognition

(2011)

V.M. Patel et al.

Dictionaries for image-based recognition

Information Theory and Applications Workshop

(2013)

I. Czarnowski

Cluster-based instance selection for machine classification

Knowl. Inf. Syst.

(2010)

ChenJ. et al.

Fast instance selection for speeding up support vector machines

Knowl.-Based Syst.

(2013)

B. Narayan et al.

Maxdiff kd-trees for data condensation

Pattern Recognit. Lett.

(2006)

J.A. Olvera-Lopez et al.

A review of instance selection methods

Artif. Intell. Rev.

(2010)

LiJ. et al.

Prototype selection based on multiûobjective optimisation and partition strategy

Int. J. Sensor Netw.

(2015)

Chien-HsingC. et al.

The generalized condensed nearest neighbor rule as a data reduction method

IEEE International Conference on Pattern Recognition

(2006)

F. Vazquez et al.

A stochastic approach to wilson’s editing algorithm

IbPRIA

(2005)

Cited by (6)

K-nearest neighbors rule combining prototype selection and local feature weighting for classification
2022, Knowledge-Based Systems
K-Nearest Neighbors (KNN) rule is a simple yet powerful classification technique in machine learning. Nevertheless, it suffers from some drawbacks such as high memory consumption, low time efficiency, class overlapping and difficulty of setting an appropriate K value. In this study, we propose an Improved K-Nearest Neighbor rule combining Prototype Selection and Local Feature Weighting (IKNN_PSLFW) to address the above issues in one framework. Differing from conventional prototype selection, IKNN_PSLFW not only selects the representative instances as prototypes, but also preserves the information of instances that are not selected. To deal with the class overlapping problem, IKNN_PSLFW explores the feature relevance in local regions by assigning different weights to different features. For an instance with unknown class label, IKNN_PSLFW uses three classification rules corresponding to three scenarios, according to the distance between the instance and each prototype, for classification. To evaluate the performance of IKNN_PSLFW, we conduct experimental study on 20 benchmark datasets. The experimental results show that compared with the conventional KNN rule, some state-of-the-art prototype selection methods and other machine learning algorithms, the proposed IKNN_PSLFW achieves promising classification performance with high time efficiency.
Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem
2020, Engineering Applications of Artificial Intelligence
Citation Excerpt :
This method is based on a set of hyperrectangles. A kernel method proposed in Dornaika et al. (2016) deployed non-linear self-representation to find representatives in a given set of instances. One of the methods focused on data distribution was introduced in Liu et al. (2017).
Instance reduction from class-balanced data has been investigated in much research. However, there is a lack of studies on class-imbalanced data. Learning from imbalanced data lately has attracted a lot of attention due to the practical applications. In the case of two-class imbalanced data, the instances from one class, majority class, are more numerous than the instances from the other class, which is a minority class. The present paper aims to introduce a new instance reduction method that preserves between-class distributions in the balanced data and handles minority class instance reduction in two-class imbalanced data, efficiently. The proposed method solves the instance reduction issue from an unconstrained multi-objective optimization problem aspect. Accordingly, a new combined weighted optimizer is designed. By employing the chaotic krill herd evolutionary algorithm, both the minority and majority class spaces with the accelerated convergence are explored. Through this method, the original data set is purged of those instances that decrease accuracy, and Gmean. The performance has been evaluated on both imbalanced and balanced data sets collected from the UCI repository by the 10-fold cross-validation method. Evaluations show that the proposed method outperforms state-of-the-art methods in terms of classification accuracy, Gmean, reduction rates, and computational time.
Study of data transformation techniques for adapting single-label prototype selection algorithms to multi-label learning
2018, Expert Systems with Applications
Citation Excerpt :
They follow heuristics that allocate a score to each instance and those instances with the highest score are selected. The cleaning of data and noise removal achieved by prototype selection has been exploited in many applications of machine learning, among which data recovery, and image processing (Dornaika, Aldine, & Hadid, 2016). Some examples of its use in real world problems are:
In this paper, the focus is on the application of prototype selection to multi-label data sets as a preliminary stage in the learning process. There are two general strategies when designing Machine Learning algorithms that are capable of dealing with multi-label problems: data transformation and method adaptation. These strategies have been successfully applied in obtaining classifiers and regressors for multi-label learning. Here we investigate the feasibility of data transformation in obtaining prototype selection algorithms for multi-label data sets from three prototype selection algorithms for single-label. The data transformation methods used were: binary relevance, dependent binary relevance, label powerset, and random k-labelsets. The general conclusion is that the methods of prototype selection obtained using data transformation are not better than those obtained through method adaptation. Moreover, prototype selection algorithms designed for multi-label do not do an entirely satisfactory job, because, although they reduce the size of the data set, without affecting significantly the accuracy, the classifier trained with the reduced data set does not improve the accuracy of the classifier when it is trained with the whole data set.
Feature and instance selection through discriminant analysis criteria
2022, Soft Computing
Feature and Instance Selection Through Discriminant Analysis Criteria
2021, Research Square
Greedy frank-wolfe algorithm for exemplar selection
2018, arXiv

View full text

Kernel sparse modeling for prototype selection

Abstract

Introduction

Section snippets

Review of sparse modeling representative selection (SMRS)

Proposed kernel sparse modeling

Performance evaluation

Discussions and conclusions

Neurocomputing

Inf. Fus.

Knowl.-Based Syst.

Procedia Comput. Sci.

Pattern Recognit. Lett.

CUR from a sparse optimization viewpoint

Advances in Neural Information Processing Systems

Selecting critical patterns based on local geometrical and statistical information

IEEE Trans. Pattern Anal. Mach. Intell.

Column subset selection, matrix factorization and eigenvalue optimization

Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA)

Oligois: Scalable instance selection for class-imbalanced data sets

IEEE Trans. Cybern.

Dissimilarity-based sparse subset selection

IEEE Trans. Pattern Anal. Mach. Intell.

Improving kNN multi-label classification in prototype selection scenarios using class proposals

Pattern Recognit.

Prototype selection for nearest neighbor classification: Taxonomy and empirical study

IEEE Trans. Pattern Anal. Mach. Intell.

K-svd: An algorithm for designing overcomplete dictionaries for sparse representation

IEEE Trans. Signal Process.

Learning a discriminative dictionary for sparse coding via label consistent k-svd

IEEE Conference on Computer Vision and Pattern Recognition

Dictionaries for image-based recognition

Information Theory and Applications Workshop

Cluster-based instance selection for machine classification

Knowl. Inf. Syst.

Fast instance selection for speeding up support vector machines

Knowl.-Based Syst.

Maxdiff kd-trees for data condensation

Pattern Recognit. Lett.

A review of instance selection methods

Artif. Intell. Rev.

Prototype selection based on multiûobjective optimisation and partition strategy

Int. J. Sensor Netw.

The generalized condensed nearest neighbor rule as a data reduction method

IEEE International Conference on Pattern Recognition

A stochastic approach to wilson’s editing algorithm

IbPRIA