Sparse semi-supervised support vector machines by DC programming and DCA

doi:10.1016/j.neucom.2014.11.051

Neurocomputing

Volume 153, 4 April 2015, Pages 62-76

https://doi.org/10.1016/j.neucom.2014.11.051 Get rights and content

Highlights

•
Two DC (Difference of Convex) Algorithms based approaches are investigated for feature selection in S3VM using the zero norm.
•
The first is DC approximation approach that approximates the zero norm by a DC function. Five approximations are considered.
•
The second is an exact reformulation approach based on exact penalty techniques in DC programming.
•
All the resulting problems are DC programs for which six versions of DCA are developed.
•
A careful empirical experiment on several benchmark datasets are performed.

Abstract

This paper studies the problem of feature selection in the context of Semi-Supervised Support Vector Machine (S3VM). The zero norm, a natural concept dealing with sparsity, is used for feature selection purpose. Due to two nonconvex terms (the loss function of unlabeled data and the ℓ₀ term), we are faced with a NP hard optimization problem. Two continuous approaches based on DC (Difference of Convex functions) programming and DCA (DC Algorithms) are developed. The first is DC approximation approach that approximates the ℓ₀-norm by a DC function. The second is an exact reformulation approach based on exact penalty techniques in DC programming. All the resulting optimization problems are DC programs for which DCA are investigated. Several usual sparse inducing functions are considered, and six versions of DCA are developed. Empirical numerical experiments on several Benchmark datasets show the efficiency of the proposed algorithms, in both feature selection and classification.

Introduction

In machine learning, supervised learning is a task of inferring a predictor function (classifier) from a labeled training dataset. Each example in training set consists of an input object and a label. The objective is to build a predictor function which can be used to identify the label of new examples with the highest possible accuracy. Nevertheless, in most of real word applications, a large portion of training data are unlabeled, and supervised learning cannot be used in these contexts. To deal with this difficulty, recently, in the machine learning community, there has been an attracting increasing attention in using semi-supervised learning methods. In contrast to supervised methods, semi-supervised learning methods take into account both labeled and unlabeled data to construct prediction models.

We are interested in semi-supervised classification, more precisely, in the so-called Semi-Supervised Support Vector Machines (S3VM). Among the semi-supervised classification methods, the large margin approach S3VM, which extends the Support Vector Machine (SVM) to semi-supervised learning concept, is certainly the most popular [5], [6], [7], [8], [9], [10], [11], [12], [13], [37], [39], [41], [47], [54], [59]. An extensive overview of semi-supervised classification methods can be found in [61]. S3VM was originally proposed by Vapnik and Sterin in 1977 [57] under the name of transductive support vector machine. Later, in 1999, Bennett and Demiriz [2] proposed the first optimization formulation of S3VM which is described as follows.

Given a training set which consists of m labeled points ${(x_{i}, y_{i}) \in R^{n} \times {- 1, 1}, i = 1, \dots, m}$ and p unlabeled points ${x_{i} \in R^{n}, i = (m + 1), \dots, (m + p)}$ . We are to find a separating hyperplane $P = {x ∣ x \in R^{n}, x^{T} w = b}$ , far away from both the labeled and unlabeled points. Hence, the optimization problem of S3VM takes the form $\min_{w, b} {‖ w ‖}_{2}^{2} + α \sum_{i = 1}^{m} L (y_{i} (〈 w, x_{i} 〉 + b)) + β \sum_{i = m + 1}^{m + p} L (| 〈 w, x_{i} 〉 + b |) .$ Here, the first two terms define a standard SVM while the third one incorporates the loss function of unlabeled data points. The loss function of labeled and unlabeled data points are weighted by penalty parameters $α > 0$ and $β > 0$ . Usually, in classical SVM one uses the hinge loss function $L (u) = \max {0, 1 - u}$ which is convex. On the contrary, the problem (1) is nonconvex, due to the nonconvexity of the third term.

There are two broad strategies for solving the optimization problem (1) of S3VM: the combinatorial methods (Mixed Integer Programming [2], Branch and Bound algorithm [6]) and the continuous optimization methods such as self-labeling heuristic $S^{3} {VM}^{light}$ [15], gradient descent [5], deterministic annealing [53], semi-definite programming [4], DC programming [9]. Combinatorial methods are not available for massive data sets in real applications (high dimension and large data set). Thus, major efforts have focused on efficient local algorithms. For more complete reviews of S3VM methods, the reader is referred to [7], [61] and references therein.

On the other hand, feature selection is one of the fundamental problems in machine learning. In many areas of application such as text classification, web mining, gene expression, micro-array analysis, and combinatorial chemistry, image analysis, data sets contain a large number of features, many of which are irrelevant or redundant. Feature selection is often applied to high dimensional data prior to classification learning. The main goal is to select a subset of features of a given data set while preserving or improving the discriminative ability of a classifier. Several feature-selection methods for SVMs have been proposed in the literature (see e.g. [3], [11], [17], [19], [29], [30], [33], [40], [46], [49], [62], [63]).

This paper deals with the feature selection in the context of S3VM. We are to find a separating hyperplane far away from both the labeled and unlabeled points, that uses the least number of features. As standard approaches for feature selection in SVM, here we use the zero norm, a natural concept dealing with sparsity for feature selection purpose. The zero norm of a vector, denoted ℓ₀ or $‖ . ‖_{0}$ , is defined as the number of its nonzero components. Similar to SVM, we replace the term $‖ w ‖_{2}^{2}$ in (1) by the ℓ₀-norm and then formulate the feature selection S3VM problem as follows: $\min_{w, b} ‖ w ‖_{0} + α \sum_{i = 1}^{m} L (y_{i} (〈 w, x_{i} 〉 + b)) + β \sum_{i = m + 1}^{m + p} L (| 〈 w, x_{i} 〉 + b |) .$

While S3VM has been widely studied, there exist few methods in the literature for feature selection in S3VM. Due to the discontinuity of the $ℓ_{0}$ term and the nonconvexity of the third term, we are facing “double” difficulties in (2) (it is well known that the problem of minimizing the zero-norm is NP-Hard [1]).

During the last two decades, research is very active in models and methods optimization involving the zero norm. Works can be divided into three categories according to the way to treat the zero norm: convex approximation (the ℓ₀-norm is replaced by a convex function, for instance the $ℓ_{1}$ -norm [55] or the conjugate function [43]), nonconvex approximation (a continuous nonconvex function is used instead to the ℓ₀-norm, usual sparse inducing functions are introduced in [3], [11], [32], [45], [63]), and nonconvex exact reformulation (with the binary variables $u_{i} = 0$ if $w_{i} = 0$ and $u_{i} = 1$ otherwise, the original problem is formulated as a combinatorial optimization problem which is equivalently reformulated as a continuous nonconvex program via exact penalty techniques, see [33], [56]). An extensive overview of these approaches can be found in [33]. When the objective function (besides the ℓ₀-term) is convex, convex approximation techniques result to a convex optimization problem which is so far “easy” to solve. Unfortunately, for S3VM, the problem (2) remains nonconvex with any approximation – convex or nonconvex, of the ℓ₀-norm.

In this work, we tackle the problem (2) by nonconvex approaches based on DC (Difference of Convex functions) programming and DCA (DC Algorithms), powerful tools in nonconvex programming framework. Our motivating arguments to use the ℓ₀-norm are multiple. Firstly, even though using the ℓ₁ is the simplest way to deal with the sparsity, the ℓ₁ can encourage the sparsity in only some cases with restrictive assumptions (see [14]). In particular, for feature selection purpose, the ℓ₁ penalty has been shown to be, in certain cases, inconsistent and biased [60]. Secondly, the ℓ₀-norm is the most natural and suitable concept for modeling the sparsity, and nonconvex approximations of the ℓ₀-norm are, in general, deeper than the ℓ₁-norm, and can then produce better sparsity. Especially, for feature selection in SVM, solutions of the ℓ₀-norm penalty problem have been shown to be much sparser than those of ℓ₁-norm approach in several previous works [3], [29], [33], [40]. Thirdly, although we are faced with a “double” difficulties in the problem (2) because of the ℓ₀-norm and the third term, the power of DCA can be exploited to efficiently solve this hard problem, knowing that DCA has been successfully developed in a variety of works in Machine Learning (see e.g. [9], [20], [27], [28], [38], [48], [50] and the list of references in [36]), in particular for feature selection in SVM [29], [30], [33], [34], [35], [40], [46].

Paper׳s contributions: We develop an unified approach based on DC programming and DCA for solving the nonconvex optimization problem (2). The first work in this research direction has been published in the conference paper [22] where we considered the piecewise exponential [3] for the ℓ₀-norm and presented a DC formulation, its corresponding DCA for solving the resulting problem as well as some preliminary numerical experiments. In this paper, we carefully explore and exploit DCA based approaches to deal with the zero norm and sparse S3VM. Firstly, several DC approximations of ℓ₀-norm will be used to (2): logarithm function [63], Smoothly Clipped Absolute Deviation (SCAD) [11], piecewise exponential [3], DC polyhedral [45] and piecewise linear approximation [32]. Secondly, we inspire the same technique in [33], [56] to equivalently formulate (2) as a combinatorial optimization problem. Then, thanks to the new result on exact penalty techniques recently developed in [31], we reformulate the resulting problem as a continuous optimization problem and investigate DCA to solve it. Finally, we provide an empirical experimentation, on several Benchmark datasets, of all proposed algorithms to study their efficiency in both feature selection and classification.

The remainder of the paper is organized as follows. DC programming and DCA are briefly presented in Section 2 while Section 3 is devoted to the development of DCA for solving the feature selection S3VM problem (2). Computational experiments are reported in Section 4 and finally Section 5 concludes the paper.

Section snippets

Outline of DC programming and DCA

DC programming and DCA constitute the backbone of smooth/nonsmooth nonconvex programming and global optimization. They address the problem of minimizing a function f which is the difference of two convex functions on the whole space $R^{d}$ or on a convex set $C \subset R^{d}$ . Generally speaking, a DC program is an optimization problem of the form: $α = \inf {f (x) ≔ g (x) - h (x) : x \in R^{d}} (P_{dc})$ where $g, h$ are lower semi-continuous proper convex functions on $R^{d}$ . Such a function f is called a DC function, and $g - h$ a DC

Feature selection in S3VM by DC programming and DCA

Assume that the m labeled points and p unlabeled points are represented by the matrix $A \in R^{m \times n}$ and $B \in R^{p \times n}$ , respectively. D is a m×m diagonal matrix where $D_{i, i} = y_{i}, \forall i = 1, \dots, m$ . Denote by e the vector of ones in the appropriate vector space. For each labeled point x_i $(i = 1, \dots, m)$ , we introduce a new variable ξ_i which represents the misclassification error. Similarly, for each unlabeled point x_i $(i = (m + 1), \dots, (m + p))$ , we define r_i and s_i for two possible misclassification errors. Let r and s be vectors in $R^{p}$

Datasets

We perform numerical experiments on two datasets. The first one contains several real-world data taken from UCI Machine Learning Repository and NIPS 2003 Feature Selection Challenge. The second dataset is the CBCL Face Database, taken from MIT Center For Biological and Computation Learning (http://cbcl.mit.edu/projects/cbcl/software-datasets/).

The CBCL Face Database is considered for an instance of the face detection problem which can be exploited in numerous application areas including

Conclusion

We have intensively investigated a unified DC programming approach for feature selection in the context of Semi-Supervised Support Vector Machine. Using 5 different approximations and the continuous exact reformulation via an exact penalty technique leads us to 6 DC programs. Then we developed 6 based DCA algorithms for solving the resulting problems. Numerical results on several real datasets showed the robustness, the effectiveness of the DCA based schemes. We are convinced that DCA is a

Acknowledgments

The authors acknowledges the support from the “Fonds Européens de Développement Régional” (FEDER) of Lorraine under the project INNOMAD.

References (63)

E. Amaldi et al.
On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems
Theor. Comput. Sci.
(1998)
H. Choi et al.
A sparse large margin semi-supervised learning method
J. Korean Stat. Soc.
(2010)
E. Hjelmas
Face detectiona survey
Comput. Vis. Image Underst.
(2001)
H.A. Le Thi et al.
Feature selection for linear SVMs under uncertain datarobust optimization based on difference of convex functions algorithms
Neural Netw.
(2014)
D. Peleg et al.
A bilinear formulation for vector sparsity optimization
Signal Process.
(2008)
Kristin P. Bennett Ayhan Demiriz, Semi-supervised support vector machines, in: Proceedings of the Conference on...
P.S. Bradley, O.L. Mangasarian, Feature selection via concave minimization and support vector machines, in: Proceeding...
T.D. Bie, N. Cristianini, Convex methods for transduction, in: Advances in Neural Information Proceeding Systems 16,...
O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Proceedings of the 10th...
O. Chapelle, V. Sindhwani, S. Keerthi, Branch and bound for semi-supervised support vector machines, in: Advances in...

O. Chapelle et al.

Optimization techniques for semi-supervised support vector machines

J. Mach. Learn. Res.

(2008)

R. Collobert et al.

Large scale transductive SVMs

J. Mach. Learn.

(2006)

W. Emara, M.K.M. Karnstedt, K. Sattler, D. Habich, W. Lehner, An approach for incremental semi-supervised SVM, in:...

J. Fan et al.

Variable selection via nonconcave penalized likelihood and its Oracle Properties

J. Am. Stat. Assoc.

(2001)

G. Fung et al.

Semi-supervised support vector machines for unlabeled data classification

Optim. Methods Softw.

(2001)

F. Gieseke, A. Airola, T. Pahikkala, O. Kramer, Sparse Quasi-Newton optimization for semi-supervised support vector...

R. Gribonval et al.

Sparse representation in union of bases

IEEE Trans. Inf. Theory

(2003)

T. Joachims, Transductive inference for text classification using support vector machines, in: 16th International...

B. Heisele, P. Ho, T. Poggio, Face recognition with support vector machines: global versus Component-based approach,...

L. Hermes, J.M. Buhmann, Feature selection for support vector machines, in: Proceedings 15th International Conference...

Z. Hui

The adaptive lasso and its oracle properties

J. Am. Stat. Assoc.

(2006)

N. Krause, Y. Singer, Leveraging the margin more carefully, in: Proceeding of ICML ׳04, NY, USA, 2004, pp....

I. Kukenys, Human face detection with support vector machines (Thesis), University of Otago,...

H.M. Le, H.A. Le Thi, M.C. Nguyen, DCA based algorithms for feature selection in semi-supervised support vector...

H.A. Le Thi, Analyse numérique des algorithmes de l׳Optimisation d. c. Approches locales et globales, Codes et...

H.A. Le Thi, Contribution à l׳optimisation non convexe et l׳optimisation globale: Théorie, Algoritmes et Applications,...

H.A. Le Thi et al.

Solving a class of linearly constrained indefinite quadratic problems by DC algorithms

J. Glob. Optim.

(1997)

H.A. Le Thi et al.

The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems

Ann. Oper. Res.

(2005)

H.A. Le Thi et al.

A new efficient algorithm based on DC programming and DCA for clustering

J. Glob. Optim.

(2006)

H.A. Le Thi et al.

Optimization based DC programming and DCA for hierarchical clustering

Eur. J. Oper. Res.

(2007)

H.A. Le Thi et al.

A DC programming approach for feature selection in support vector machines learning

J. Adv. Data Anal. Classif.

(2008)

Cited by (0)

Hoai Minh Le earned his Ph.D. in Computer Science from the University Paul Verlaine-Metz, France in 2007. From 2008 to 2009, he was a post-doctoral researcher at the University of Le Havre, France. Since 2010, he is a researcher at the Laboratory of Theoretical & Applied Computer Science, University of Lorraine, France. He recently joined the Modeling and Optimization of Complex Systems Research Group, Ton Duc Thang University, Ho Chi Minh City, Vietnam. His research interests include non-convex optimization in Machine Learning (supervised/unsupervised classification, feature selection...), Scheduling and Cryptography. He is the author/co-author of more than 25 journal articles, international conference papers and book chapters and is a reviewer of several international journals/conferences.

Hoai An Le Thi obtained her PhD with Distinction in Optimization in 1994, her Habilitation in 1997 from university of Rouen, France. She is currently a Director of Laboratory of Theoretical & Applied Computer Science, university of Lorraine and serving as a full Professor of Exceptional Class. She is the author/co-author of more than 180 journal articles, international conference papers and book chapters, the co-editor of 5 books and 12 special issues of international journals. She has been a president of Scientific Committee and a president of Organizing Committee as well as member of Scientific Committee of various international conferences, and has been heading of several regional/national/international projects. Her research interests include Machine Learning, Optimization and Operations Research and their applications in Information Systems and various complex Industrial Systems. She is the co-founder (with Pham Dinh Tao) of DC programming and DCA, an innovative approach in non-convex programming.

Manh Cuong Nguyen is an assistance professor at Hanoi University of Industry since 2004. He earned his Ph.D. in May 2014 from University of Lorraine, France. He is currently a postdoctoral researcher in a R&D project at the Laboratory of Theoretical & Applied Computer Science, University of Lorraine. His research interests include issues related to nonconvex optimization applied on data mining and machine learning such as community detection in large networks, Kohonen maps, feature selection in support vector machines, and recommender systems.

View full text

Sparse semi-supervised support vector machines by DC programming and DCA

Highlights

Abstract

Introduction

Section snippets

Outline of DC programming and DCA

Feature selection in S3VM by DC programming and DCA

Datasets

Conclusion

Acknowledgments

Theor. Comput. Sci.

J. Korean Stat. Soc.

Comput. Vis. Image Underst.

Neural Netw.

Signal Process.

Optimization techniques for semi-supervised support vector machines

J. Mach. Learn. Res.

Large scale transductive SVMs

J. Mach. Learn.

Variable selection via nonconcave penalized likelihood and its Oracle Properties

J. Am. Stat. Assoc.

Semi-supervised support vector machines for unlabeled data classification

Optim. Methods Softw.

Sparse representation in union of bases

IEEE Trans. Inf. Theory

The adaptive lasso and its oracle properties

J. Am. Stat. Assoc.

Solving a class of linearly constrained indefinite quadratic problems by DC algorithms

J. Glob. Optim.

The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems

Ann. Oper. Res.

A new efficient algorithm based on DC programming and DCA for clustering

J. Glob. Optim.

Optimization based DC programming and DCA for hierarchical clustering

Eur. J. Oper. Res.

A DC programming approach for feature selection in support vector machines learning

J. Adv. Data Anal. Classif.