Elsevier

Knowledge-Based Systems

Volume 107, 1 September 2016, Pages 61-69
Knowledge-Based Systems

Kernel sparse modeling for prototype selection

https://doi.org/10.1016/j.knosys.2016.05.058Get rights and content

Abstract

Recently, a new method termed Sparse Modeling Representative Selection (SMRS) has been proposed for selecting the most relevant instances in datasets. SMRS is based on data self-representativeness in the sense that it estimates a coding matrix using a dictionary of samples set to the data themselves. Sample relevances are derived from the matrix of coefficients with a block sparsity constraint. Due to the use of a linear model for data self-representation, SMRS cannot always provide good relevant samples. Besides, most of the selected relevant samples by SMRS are in dense regions. In this paper, we propose to overcome the shortcomings of the SMRS method by deploying non-linear data self-representativeness through the use of two kinds of data projections: kernel trick and column generation. Qualitative evaluation is performed on summarizing two video movies. Quantitative evaluations are obtained by performing classification tasks on the summarized training image datasets where the objective is to compare the relevance of selected samples for a given classification task and for a given instance selection method. The conducted experiments showed that the proposed methods can outperform state-of-the art methods including the SMRS method.

Introduction

Estimating a subset of prototypes, known as representatives, that can efficiently and reliably describe the whole dataset, is an important issue in the analysis of scientific data. It has a lot of applications in machine learning, data recovery, image processing, etc. Due to the effectiveness of prototype selection for speeding up training processes, many methods have been proposed [1], [2], [3], [4], [5], [6], [7]. The selected representatives can summarize datasets of images, videos, texts or Web documents. Finding a small number of prototypes which replaces the learning database has two main advantages: (i) reducing the memory space needed to store data, and (ii) improving the computation time of classification algorithms. For example, the Nearest Neighbor (NN) classifier is more efficient [8] when comparing test samples to few representatives rather than to all training samples. A reduced training dataset can also speed up the training process in the sense that the classifier training becomes less computationally expensive. For pattern recognition tasks, it is also required that the overall performance will not be considerably affected by the data reduction. The problem can be stated as follows: given a training set T, the goal of a prototype selection method is to obtain a subset S ⊆ T such that S does not contain superfluous prototypes and Acc(S) ≃ Acc(T) where Acc(S) is the classification accuracy obtained using the subset S as training set. Instance selection methods can either start with S= (incremental methods) or S=T (decremental methods). The difference is that the incremental methods add prototypes in S during the selection process and decremental methods remove prototypes from S along the selection.

Standard dictionary learning algorithms (e.g, [9], [10], [11]) calculate items in a continuous solution space, i.e. they select representatives that mixes data samples and almost never coincide with original dataset. Thus, dictionary learning may not be suitable for tasks like video summarization and data sample selection. On the other hand, prototype selection provides a discrete dictionary where the atoms are chosen from the original dataset.

Like in feature selection, according to the strategy used for selecting prototypes, we can divide the prototype selection methods into two groups: (i) Wrapper methods in which the selection criterion is based on the accuracy obtained by a classifier (commonly, those prototypes that do not contribute to the classification accuracy are discarded from the training set) (e.g. [12], [13]), and (ii) Filter methods in which the selection criterion uses a selection function which is not based on a classifier (e.g., [14]). A good review on wrapper and filter methods can be found in [15]. In [16], the authors proposed a wrapper technique that exploits a new two layer genetic algorithm based on a divide and conquer partition strategy. Many prototype selection algorithms (e.g., [17], [18]) are strongly related to the use of the k-NN classifier. One can also find prototype selection algorithms that do not restrict the use of a specific classifier. Examples of this kind of algorithms are the evolutionary ones (e.g., [19]), which use the accuracy of a classifier as selection criterion. In these algorithms, a prototype is deleted whenever it does not contribute for either maintaining or improving the classification accuracy. For instance, the work of [19] proposed evolutionary algorithms for selecting prototypes. The memetic algorithms combine evolutionary algorithms and local search, within the evolutionary cycle; a local search (among the chromosomes) is carried out in order to improve the accuracy and reducing the size of the solutions.

Some works address the instance selection for regression tasks (e.g., [20]). In [21], the authors proposed boosting instance selection algorithms by considering instance selection as a binary classification problem: selected or unselected. With this view of instance selection, the philosophy of boosting and constructing ensembles of instance selectors was possible. Several rounds of an instance selection procedure are performed on different samples from the training set. In [22], the author proposed an ensemble method for instance selection. The framework uses the Edited Nearest Neighbor or the Condensed Nearest Neighbor algorithm over several subsets of features. The final instance selection is then provided by a voting scheme.

The filter algorithms can be divided into two main groups. The first category finds representatives from data contained in one or several subspaces of reduced dimensionality. For instance, the algorithm Rank Revealing QR (RRQR) [23] tries to select a few data points through finding a permutation of the data which gives the best conditioned sub-matrix. Greedy and Randomized algorithms have also been proposed in order to find a subset of columns in a reduced rank matrix [1], [3], [24]. In [25], the authors use the multi-label Edited Nearest Neighbor prototype selection algorithm in order to annotate large image datasets using the Kernel Extreme Machine Learning. In [26], the authors propose an instance selection method that is based on data self-representativeness adopting a block sparsity constraint. In [27], the authors propose a weighted and recursive variant of the method proposed in [26]. It was shown that a recursive instance elimination together with a weighted coding can lead to better performance for the instance selection.

The second category finds representatives assuming there is a natural grouping of data collection based on an appropriate measure of similarity between pairs of data points [28], [29], [30]. Accordingly, these algorithms generally work on the similarity/dissimilarity between data points to be grouped. The Kmedoids algorithm [28], which can be considered as a variant of Kmeans [31], supposes that the data are located around several centers of classes, called medoids, which are selected from the data. Another algorithm based on the similarity/dissimilarity of data points is the Affinity propagation (AP) [30]. This algorithm tries to find representatives from the similarities between pairs of data points by using a message passing algorithm. Although AP has suboptimal properties and can find approximate solutions, it does not require any initialization (like Kmeans and Kmedoids) [32].

Recently, a new filter method called Sparse Modeling Representative Selection (SMRS) [26], has been proposed to find sample representatives. It is based on setting every data sample as a linear combination of the whole dataset with a block-sparsity constraint. SMRS is essentially based on the assumption that for each sample in the dataset there exist some samples that form a linear subspace that the sample belongs to or very close to it. This assumption is very similar to the one used by the Locally Linear Embedding (LLE) technique where each sample is assumed to be an affine combination of neighboring samples [33]. LLE is a nonlinear dimension reduction approach and it supposes that each data point and its neighbors are lying on or close to a locally linear manifold.

In SMRS method, the whole dataset is used as a dictionary and the block sparsity is imposed on the matrix of coding coefficients in order to enhance the relevance of samples. SMRS suffers from at least two shortcomings. First, the linear assumption can be violated. Indeed, real-world data usually have non-linear distributions such that, even in local neighborhoods, a linear subspace can be a rough approximation. Second, due to the use of a linear model for large dictionaries, samples belonging to dense regions in data space will have large coefficients. As a consequence, SMRS tends to select the majority of relevant instances in dense regions. This can be undesirable from the point of view of classification where the presence of samples at class borders can enhance the discrimination between different classes.

In this paper, we propose to overcome these shortcomings. We will use kernel sparse subspace modeling where the linear combination is performed in high dimensional space in the hope that sample relevance can be better captured in these high dimensional spaces. In the literature, the kernel trick has been used to get the non-linear version of well-known linear classifiers such as Support Vector Machines (SVM). The use of kernels was also useful for getting the non-linear variant of many linear embedding techniques such as Linear Discriminant Analysis (LDA) and Local Discriminant Embedding (LDE). In [34], we proposed a coding scheme that can estimate the coding matrix using data self-representativeness. The approach estimates the coding of every sample with respect to the dataset using a two phase collaborative neighbor coding that imposes implicit and explicit sparsity on the coding coefficients. There are two main differences between [34] and the current work. Firstly, the work in [34] adopts a linear model for data self-representativeness whereas the current work uses non-linear models induced by kernels. Secondly, the work of [34] utilizes one independent coding for every sample, which cannot be as efficient as the estimation of one single coding matrix.

The paper is structured as follows: Section 2 presents a brief review of the SMRS method. In Section 3, we describe our proposed kernel methods. Section 4 presents a qualitative evaluation on video summarization and a quantitative evaluation that quantifies the classification performance based on the selected prototypes. Finally, we provide some concluding remarks in Section 5. In the sequel, capital bold letters denote matrices and small bold letters denote vectors.

Section snippets

Review of sparse modeling representative selection (SMRS)

In this section, we briefly describe the Sparse Modeling Representative Selection (SMRS) method proposed in [26]. The problem formulation can be stated as follows. Consider a set of data samples T={y1,,yN} in Rd arranged as the columns of the data matrix Y=[y1,,yN], where d denotes the sample dimension. The objective is to select the most representative samples within the set of samples T. SMRS is a filter method that uses the concept of relevance ranking. In other words, the relevance score

Proposed kernel sparse modeling

In this section, we introduce two kernelized sparse modeling selection schemes. The first one adopts the projection onto Hilbert space [36]. The second uses the trick of column generation. In fact, in several important problems in computer vision such as face recognition and activity recognition, the data can be well approximated by a union of subspaces. It is very important to notice that theses subspaces are observed in the ambient space in which data distribution can have high non-linearity.

Performance evaluation

In this section, we first provide a qualitative evaluation of the SMRS method and the proposed method when applied to video sequence summarization. We then provide quantitative comparison provided by classification results over four benchmark image datasets after selecting training representatives using different competing selection methods and different classifiers.

Discussions and conclusions

We proposed a Kernel Sparse Modeling Representation method for finding representatives in a given set of data samples. We compared our proposed methods with several competing methods used in the domain. Experimental results on several public databases and a video movie are presented to demonstrate the efficacy of the proposed approaches. The databases correspond to images, which makes the selection and classification more challenging. The classification results, conducted with three different

References (44)

  • J. Calvo-Zaragoza et al.

    Improving kNN multi-label classification in prototype selection scenarios using class proposals

    Pattern Recognit.

    (2015)
  • S. Garcia et al.

    Prototype selection for nearest neighbor classification: Taxonomy and empirical study

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • M. Aharon et al.

    K-svd: An algorithm for designing overcomplete dictionaries for sparse representation

    IEEE Trans. Signal Process.

    (2006)
  • JiangZ. et al.

    Learning a discriminative dictionary for sparse coding via label consistent k-svd

    IEEE Conference on Computer Vision and Pattern Recognition

    (2011)
  • V.M. Patel et al.

    Dictionaries for image-based recognition

    Information Theory and Applications Workshop

    (2013)
  • I. Czarnowski

    Cluster-based instance selection for machine classification

    Knowl. Inf. Syst.

    (2010)
  • ChenJ. et al.

    Fast instance selection for speeding up support vector machines

    Knowl.-Based Syst.

    (2013)
  • B. Narayan et al.

    Maxdiff kd-trees for data condensation

    Pattern Recognit. Lett.

    (2006)
  • J.A. Olvera-Lopez et al.

    A review of instance selection methods

    Artif. Intell. Rev.

    (2010)
  • LiJ. et al.

    Prototype selection based on multiûobjective optimisation and partition strategy

    Int. J. Sensor Netw.

    (2015)
  • Chien-HsingC. et al.

    The generalized condensed nearest neighbor rule as a data reduction method

    IEEE International Conference on Pattern Recognition

    (2006)
  • F. Vazquez et al.

    A stochastic approach to wilson’s editing algorithm

    IbPRIA

    (2005)
  • Cited by (6)

    • Combined weighted multi-objective optimizer for instance reduction in two-class imbalanced data problem

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      This method is based on a set of hyperrectangles. A kernel method proposed in Dornaika et al. (2016) deployed non-linear self-representation to find representatives in a given set of instances. One of the methods focused on data distribution was introduced in Liu et al. (2017).

    • Study of data transformation techniques for adapting single-label prototype selection algorithms to multi-label learning

      2018, Expert Systems with Applications
      Citation Excerpt :

      They follow heuristics that allocate a score to each instance and those instances with the highest score are selected. The cleaning of data and noise removal achieved by prototype selection has been exploited in many applications of machine learning, among which data recovery, and image processing (Dornaika, Aldine, & Hadid, 2016). Some examples of its use in real world problems are:

    View full text