Elsevier

Neurocomputing

Volume 71, Issues 4–6, January 2008, Pages 611-619
Neurocomputing

Support vector machine classification for large data sets via minimum enclosing ball clustering

https://doi.org/10.1016/j.neucom.2007.07.028Get rights and content

Abstract

Support vector machine (SVM) is a powerful technique for data classification. Despite of its good theoretic foundations and high classification accuracy, normal SVM is not suitable for classification of large data sets, because the training complexity of SVM is highly dependent on the size of data set. This paper presents a novel SVM classification approach for large data sets by using minimum enclosing ball clustering. After the training data are partitioned by the proposed clustering method, the centers of the clusters are used for the first time SVM classification. Then we use the clusters whose centers are support vectors or those clusters which have different classes to perform the second time SVM classification. In this stage most data are removed. Several experimental results show that the approach proposed in this paper has good classification accuracy compared with classic SVM while the training is significantly faster than several other SVM classifiers.

Introduction

There are a number of standard classification techniques in literature, such as simple rule based and nearest neighbor classifiers, Bayesian classifiers, artificial neural networks, decision tree, support vector machine (SVM), ensemble methods, etc. Among these techniques, SVM is one of the best-known techniques for its optimization solution [10], [20], [29]. Recently, many new SVM classifiers have been reported. A geometric approach to SVM classification was given by [21]. Fuzzy neural network SVM classifier was studied by [19]. Despite of its good theoretic foundations and generalization performance, SVM is not suitable for classification of large data sets since SVM needs to solve the quadratic programming problem (QP) in order to find a separation hyperplane, which causes an intensive computational complexity.

Many researchers have tried to find possible methods to apply SVM classification for large data sets. Generally, these methods can be divided into two types: (1) modify SVM algorithm so that it could be applied to large data sets, and (2) select representative training data from a large data set so that a normal SVM could handle.

For the first type, a standard projected conjugate gradient (PCG) chunking algorithm can scale somewhere between linear and cubic in the training set size [9], [16]. Sequential minimal optimization (SMO) is a fast method to train SVM [24], [8]. Training SVM requires the solution of QP optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems, and it is faster than PCG chunking. Dang et al. [11] introduced a parallel optimization step where block diagonal matrices are used to approximate the original kernel matrix so that SVM classification can be split into hundreds of subproblems. A recursive and computational superior mechanism referred as adaptive recursive partitioning was proposed in [17], where the data are recursively subdivided into smaller subsets. Genetic programming is able to deal with large data sets that do not fit in main memory [12]. Neural networks technique can also be applied for SVM to simplify the training process [15].

For the second type, clustering has been proved to be an effective method to collaborate with SVM on classifying large data sets. For examples, hierarchical clustering [31], [1], k-means cluster [5] and parallel clustering [8]. Clustering-based methods can reduce the computations burden of SVM, however, the clustering algorithms themselves are still complicated for large data set. Rocchio bundling is a statistics-based data reduction method [26]. The Bayesian committee machine is also reported to be used to train SVM on large data sets, where the large data set is divided into m subsets of the same size, and m models are derived from the individual sets [27]. But, it has higher error rate than normal SVM and the sparse property does not hold.

In this paper, a new approach for reducing the training data set is proposed by using minimum enclosing ball (MEB) clustering. MEB computes the smallest ball which contains all the points in a given set. It uses the core-sets idea [18], [3] to partition input data set into several balls, named k-balls clustering. For normal clustering, the number of clusters may be predefined, since determining the optimal number of clusters may involve more computational cost than clustering itself. The method of this paper does not need the optimal number of clusters, we only need to partition the training data set and to extract support vectors with SMO. Then we remove the balls which are not support vectors. For the remaining balls, we apply de-clustering technique, and classify it with SMO again, then we obtain the final support vectors. The experimental results show that the accuracy obtained by our approach is very close to the classic SVM methods, while the training time is significantly shorter. The proposed approach can therefore classify huge data sets with high accuracy.

Section snippets

MEB clustering algorithm

MEB clustering proposed in this paper uses the concept of core-sets. It is defined as follows.

Definition 1

The ball with center c and radius r is denoted as B(c,r).

Definition 2

Given a set of points S={x1,,xm} with xiRd, the MEB of S is the smallest ball that contains all balls and also all points in S, it is denoted as MEB(S).

Because it is very difficult to find the optimal ball MEB(S), we use an approximation method which is defined as follows.

Definition 3

(1+ε)-approximation of MEB(S) is denoted as a ball B(c,(1+ε)r), ε>0

SVM classification via MEB clustering

Let (X,Y) be the training patterns set,X={x1,,xn},Y={y1,,yn},yi=±1,xi=(xi1,,xip)TRp.The training task of SVM classification is to find the optimal hyperplane from the input X and the output Y, which maximizes the margin between the classes. By the sparse property of SVM, the data which are not support vectors will not contribute to the optimal hyperplane. The input data sets which are far away from the decision hyperplane should be eliminated, meanwhile the data sets which are possibly

Memory space

In the first step clustering the total input data setX={x1,,xn},Y={y1,,yn},yi=±1,xi=(xi1,,xip)TRpis loaded into the memory. The data type is float, so the data size is 4 bytes. If we use normal SVM classification, the memory size for the input data should be 4(n×p)2 because of the kernel matrix while the size for the clustering data is 4(n×p). In the first stage SVM classification, the training data size is 4(l+m)2×p2, where l is the number of the clusters, m is the number of the elements

Experimental results

In this section we use four examples to compare our algorithms with some other SVM classification methods. In order to clarify the basic idea of our approach, let us first consider a very simple case of classification and clustering.

Example 1

We generate a set of data randomly in the range of (0,40). The data set has two dimensions Xi=xi,1,xi,2. The output (label) is decided as follows:yi=+1ifWXi+b>th,-1otherwisewhere W=[1.2,2.3]T, b=10, th=95. In this way, the data set is linearly separable.

Example 2

In this

Conclusion and discussion

In this paper, we proposed a new classification method for large data sets which takes the advantages of the minimum enclosing ball and the support vector machine (SVM). Our two stages SVM classification has the following advantages compared with the other SVM classifiers:

  • 1.

    It can be as fast as possible depending on the accuracy requirement.

  • 2.

    The training data size is smaller than that of some other SVM approaches, although we need twice classifications.

  • 3.

    The classification accuracy does not decrease

Jair Cervantes received the B.S. degree in Mechanical Engineering from Orizaba Technologic Institute, Veracruz, Mexico, in 2001 and the M.S degree in Automatic Control from CINVESTAV-IPN, México, in 2005. He is currently pursuing the Ph.D. degree in the Department of Computing, CINVESTAV-IPN. His research interests include support vector machine, pattern classification, neural networks, fuzzy logic and clustering.

References (26)

  • M. Awad, L. Khan, F.Bastani, I. L.Yen, An effective support vector machine SVMs performance using hierarchical...
  • M. Badoiu, S. Har-Peled, P. Indyk. Approximate clustering via core-sets, in: Proceedings of the 34th Symposium on...
  • P. Burman

    A comparative study of ordinary cross-validation, v-Fold cross-validation and the repeated learning-testing methods

    Biometrika

    (1989)
  • J. Cervantes, X. Li, W. Yu, Support vector machine classification based on fuzzy clustering for large data sets, in:...
  • C.-C.Chang, C.-J. Lin, LIBSVM: a library for support vector machines, 〈http://www.csie.ntu.edu.tw/∼cjlin/libsvm〉,...
  • P.-H. Chen et al.

    A study on SMO-type decomposition methods for support vector machines

    IEEE Trans. Neural Networks

    (2006)
  • R. Collobert et al.

    SVMTorch: Support vector machines for large regression problems

    J. Mach. Learn. Res.

    (2001)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

    (2000)
  • J.-X. Dong et al.

    Fast SVM training algorithm with decomposition on very large data sets

    IEEE Trans. Pattern Anal. Mach. Intell. .

    (2005)
  • G. Folino et al.

    GP Ensembles for Large-Scale Data Classification

    IEEE Trans. Evol. Comput.

    (2006)
  • B.V. Gnedenko et al.

    Mathematical Methods of Reliability Theory

    (1969)
  • G.B. Huang, K.Z. Mao, C.K. Siew, D.-S. Huang, Fast modular network implementation for support vector machines, IEEE...
  • T. Joachims

    Making large-scale support vector machine learning practice

  • Cited by (120)

    View all citing articles on Scopus

    Jair Cervantes received the B.S. degree in Mechanical Engineering from Orizaba Technologic Institute, Veracruz, Mexico, in 2001 and the M.S degree in Automatic Control from CINVESTAV-IPN, México, in 2005. He is currently pursuing the Ph.D. degree in the Department of Computing, CINVESTAV-IPN. His research interests include support vector machine, pattern classification, neural networks, fuzzy logic and clustering.

    Xiaoou Li received her B.S. and Ph.D. degrees in applied Mathematics and Electrical Engineering from Northeastern University, China, in 1991 and 1995.

    From 1995 to 1997, she was a lecturer of Electrical Engineering at the Department of Automatic Control of Northeastern University, China. From 1998 to 1999, she was an associate professor of Computer Science at the Centro de Instrumentos, Universidad Nacional Autónoma de México (UNAM), México. Since 2000, she has been a professor of the Departamento de Computación, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV-IPN), México. During the period from September 2006 to August 2007, she was a visiting professor of School of Electronics, Electrical Engineering and Computer Science, the Queen´s University of Belfast, UK.

    Her research interests include Petri net theory and application, neural networks, knowledge based system, and data mining.

    Wen Yu He received the B.S. degree from Tsinghua University, Beijing, China in 1990 and the M.S. and Ph.D. degrees, both in Electrical Engineering, from Northeastern University, Shenyang, China, in 1992 and 1995, respectively. From 1995 to 1996, he served as a Lecture in the Department of Automatic Control at Northeastern University, Shenyang, China. In 1996, he joined CINVESTAV-IPN, México, where he is a professor in the Departamento de Control Automático. He has held a research position with the Instituto Mexicano del Petróleo, from December 2002 to November 2003. He was a visiting senior research fellow of Queen's University Belfast from October 2006 to December 2006. He is a also a visiting professor of Northeastern University in China from 2006 to 2008. He is currently an associate editor of Neurocomputing, and International Journal of Modelling, Identification and Control. He is a senior member of IEEE. His research interests include adaptive control, neural networks, and fuzzy Control.

    Kang Li is a lecturer in intelligent systems and control, Queen's University Belfast. He received B.Sc. (Xiangtan) in 1989, M.Sc. (HIT) in 1992 and Ph.D. (Shanghai Jiaotong) in 1995. He held various research positions at Shanghai Jiaotong University (1995–1996), Delft University of Technology (1997), and Queen's University Belfast (1998–2002). His research interest covers non-linear system modelling and identification, neural networks, genetic algorithms, process control, and human supervisory control. Dr. Li is a Chartered Engineer and a member of the IEEE and the InstMC.

    View full text