Support vector machine classification for large data sets via minimum enclosing ball clustering

doi:10.1016/j.neucom.2007.07.028

Neurocomputing

Volume 71, Issues 4–6, January 2008, Pages 611-619

https://doi.org/10.1016/j.neucom.2007.07.028 Get rights and content

Abstract

Support vector machine (SVM) is a powerful technique for data classification. Despite of its good theoretic foundations and high classification accuracy, normal SVM is not suitable for classification of large data sets, because the training complexity of SVM is highly dependent on the size of data set. This paper presents a novel SVM classification approach for large data sets by using minimum enclosing ball clustering. After the training data are partitioned by the proposed clustering method, the centers of the clusters are used for the first time SVM classification. Then we use the clusters whose centers are support vectors or those clusters which have different classes to perform the second time SVM classification. In this stage most data are removed. Several experimental results show that the approach proposed in this paper has good classification accuracy compared with classic SVM while the training is significantly faster than several other SVM classifiers.

Introduction

There are a number of standard classification techniques in literature, such as simple rule based and nearest neighbor classifiers, Bayesian classifiers, artificial neural networks, decision tree, support vector machine (SVM), ensemble methods, etc. Among these techniques, SVM is one of the best-known techniques for its optimization solution [10], [20], [29]. Recently, many new SVM classifiers have been reported. A geometric approach to SVM classification was given by [21]. Fuzzy neural network SVM classifier was studied by [19]. Despite of its good theoretic foundations and generalization performance, SVM is not suitable for classification of large data sets since SVM needs to solve the quadratic programming problem (QP) in order to find a separation hyperplane, which causes an intensive computational complexity.

Many researchers have tried to find possible methods to apply SVM classification for large data sets. Generally, these methods can be divided into two types: (1) modify SVM algorithm so that it could be applied to large data sets, and (2) select representative training data from a large data set so that a normal SVM could handle.

For the first type, a standard projected conjugate gradient (PCG) chunking algorithm can scale somewhere between linear and cubic in the training set size [9], [16]. Sequential minimal optimization (SMO) is a fast method to train SVM [24], [8]. Training SVM requires the solution of QP optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems, and it is faster than PCG chunking. Dang et al. [11] introduced a parallel optimization step where block diagonal matrices are used to approximate the original kernel matrix so that SVM classification can be split into hundreds of subproblems. A recursive and computational superior mechanism referred as adaptive recursive partitioning was proposed in [17], where the data are recursively subdivided into smaller subsets. Genetic programming is able to deal with large data sets that do not fit in main memory [12]. Neural networks technique can also be applied for SVM to simplify the training process [15].

For the second type, clustering has been proved to be an effective method to collaborate with SVM on classifying large data sets. For examples, hierarchical clustering [31], [1], k-means cluster [5] and parallel clustering [8]. Clustering-based methods can reduce the computations burden of SVM, however, the clustering algorithms themselves are still complicated for large data set. Rocchio bundling is a statistics-based data reduction method [26]. The Bayesian committee machine is also reported to be used to train SVM on large data sets, where the large data set is divided into m subsets of the same size, and m models are derived from the individual sets [27]. But, it has higher error rate than normal SVM and the sparse property does not hold.

In this paper, a new approach for reducing the training data set is proposed by using minimum enclosing ball (MEB) clustering. MEB computes the smallest ball which contains all the points in a given set. It uses the core-sets idea [18], [3] to partition input data set into several balls, named k-balls clustering. For normal clustering, the number of clusters may be predefined, since determining the optimal number of clusters may involve more computational cost than clustering itself. The method of this paper does not need the optimal number of clusters, we only need to partition the training data set and to extract support vectors with SMO. Then we remove the balls which are not support vectors. For the remaining balls, we apply de-clustering technique, and classify it with SMO again, then we obtain the final support vectors. The experimental results show that the accuracy obtained by our approach is very close to the classic SVM methods, while the training time is significantly shorter. The proposed approach can therefore classify huge data sets with high accuracy.

Section snippets

MEB clustering algorithm

MEB clustering proposed in this paper uses the concept of core-sets. It is defined as follows.

Definition 1

The ball with center c and radius r is denoted as $B (c, r)$ .

Definition 2

Given a set of points $S = {x_{1}, \dots, x_{m}}$ with $x_{i} \in R^{d}$ , the MEB of S is the smallest ball that contains all balls and also all points in $S,$ it is denoted as $MEB (S)$ .

Because it is very difficult to find the optimal ball $MEB (S)$ , we use an approximation method which is defined as follows.

Definition 3

$(1 + ε)$ -approximation of $MEB (S)$ is denoted as a ball $B (c, (1 + ε) r)$ , $ε > 0$

SVM classification via MEB clustering

Let $(X, Y)$ be the training patterns set, $X = {x_{1}, \dots, x_{n}}, Y = {y_{1}, \dots, y_{n}}, y_{i} = \pm 1, x_{i} = (x_{i 1,} \dots, x_{ip})^{T} \in R^{p} .$ The training task of SVM classification is to find the optimal hyperplane from the input X and the output Y, which maximizes the margin between the classes. By the sparse property of SVM, the data which are not support vectors will not contribute to the optimal hyperplane. The input data sets which are far away from the decision hyperplane should be eliminated, meanwhile the data sets which are possibly

Memory space

In the first step clustering the total input data set $X = {x_{1}, \dots, x_{n}}, Y = {y_{1}, \dots, y_{n}}, y_{i} = \pm 1, x_{i} = (x_{i 1}, \dots, x_{ip})^{T} \in R^{p}$ is loaded into the memory. The data type is float, so the data size is 4 bytes. If we use normal SVM classification, the memory size for the input data should be $4 (n \times p)^{2}$ because of the kernel matrix while the size for the clustering data is $4 (n \times p)$ . In the first stage SVM classification, the training data size is $4 (l + m)^{2} \times p^{2},$ where l is the number of the clusters, m is the number of the elements

Experimental results

In this section we use four examples to compare our algorithms with some other SVM classification methods. In order to clarify the basic idea of our approach, let us first consider a very simple case of classification and clustering.

Example 1

We generate a set of data randomly in the range of $(0, 40)$ . The data set has two dimensions $X_{i} = [x_{i, 1}, x_{i, 2}]$ . The output (label) is decided as follows: $y_{i} = \{\begin{matrix} + 1 & if {WX}_{i} + b > th, \\ - 1 & otherwise \end{matrix}$ where $W = [1.2, 2.3]^{T}$ , $b = 10,$ $th = 95$ . In this way, the data set is linearly separable.

Example 2

In this

Conclusion and discussion

In this paper, we proposed a new classification method for large data sets which takes the advantages of the minimum enclosing ball and the support vector machine (SVM). Our two stages SVM classification has the following advantages compared with the other SVM classifiers:

1.
It can be as fast as possible depending on the accuracy requirement.
2.
The training data size is smaller than that of some other SVM approaches, although we need twice classifications.
3.
The classification accuracy does not decrease

Jair Cervantes received the B.S. degree in Mechanical Engineering from Orizaba Technologic Institute, Veracruz, Mexico, in 2001 and the M.S degree in Automatic Control from CINVESTAV-IPN, México, in 2005. He is currently pursuing the Ph.D. degree in the Department of Computing, CINVESTAV-IPN. His research interests include support vector machine, pattern classification, neural networks, fuzzy logic and clustering.

References (26)

M. Awad, L. Khan, F.Bastani, I. L.Yen, An effective support vector machine SVMs performance using hierarchical...
M. Badoiu, S. Har-Peled, P. Indyk. Approximate clustering via core-sets, in: Proceedings of the 34th Symposium on...
P. Burman
A comparative study of ordinary cross-validation, v-Fold cross-validation and the repeated learning-testing methods
Biometrika
(1989)
J. Cervantes, X. Li, W. Yu, Support vector machine classification based on fuzzy clustering for large data sets, in:...
C.-C.Chang, C.-J. Lin, LIBSVM: a library for support vector machines, 〈http://www.csie.ntu.edu.tw/∼cjlin/libsvm〉,...
P.-H. Chen et al.
A study on SMO-type decomposition methods for support vector machines
IEEE Trans. Neural Networks
(2006)
R. Collobert et al.
SVMTorch: Support vector machines for large regression problems
J. Mach. Learn. Res.
(2001)
N. Cristianini et al.
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
(2000)
J.-X. Dong et al.
Fast SVM training algorithm with decomposition on very large data sets
IEEE Trans. Pattern Anal. Mach. Intell. .
(2005)
G. Folino et al.
GP Ensembles for Large-Scale Data Classification
IEEE Trans. Evol. Comput.
(2006)

B.V. Gnedenko et al.

Mathematical Methods of Reliability Theory

(1969)

G.B. Huang, K.Z. Mao, C.K. Siew, D.-S. Huang, Fast modular network implementation for support vector machines, IEEE...

T. Joachims

Making large-scale support vector machine learning practice

Cited by (120)

A simple and reliable instance selection for fast training support vector machine: Valid Border Recognition
2023, Neural Networks
Support vector machines (SVMs) are powerful statistical learning tools, but their application to large datasets can cause time-consuming training complexity. To address this issue, various instance selection (IS) approaches have been proposed, which choose a small fraction of critical instances and screen out others before training. However, existing methods have not been able to balance accuracy and efficiency well. Some methods miss critical instances, while others use complicated selection schemes that require even more execution time than training with all original instances, thus violating the initial intention of IS. In this work, we present a newly developed IS method called Valid Border Recognition (VBR). VBR selects the closest heterogeneous neighbors as valid border instances and incorporates this process into the creation of a reduced Gaussian kernel matrix, thus minimizing the execution time. To improve reliability, we propose a strengthened version of VBR (SVBR). Based on VBR, SVBR gradually adds farther heterogeneous neighbors as complements until the Lagrange multipliers of already selected instances become stable. In numerical experiments, the effectiveness of our proposed methods is verified on benchmark and synthetic datasets in terms of accuracy, execution time and inference time.
Database memory forensics: A machine learning approach to reverse-engineer query activity
2023, Forensic Science International: Digital Investigation
Memory analysis allows forensic investigators to establish a more complete timeline of system activity using a snapshot of main memory (i.e., RAM). Investigators may rely on such analysis to detect malicious activity and understand the scope of what data was exfiltrated. This is of particular interest in the presence of incomplete or untrusted logs, where a privileged user (or an attacker with such capabilities) can altogether bypass or disable logging. In such instances, a forensic investigator can still rely on the fact that data must ultimately be processed in memory, regardless of the information that is recorded in audit logs.
In this work, we propose methods to reverse-engineer query activity from a database management system (DBMS) process snapshot. Since DBMSes are used to manage and store an organization's most sensitive data, they are of particular concern for data exfiltration. A DBMS processes queries using a series of operations, such as index sort, file sort, or joins, which produce their own set of distinct forensic artifacts in memory. Our methods use these artifacts to make conclusions about recent query activity even in the presence of untrusted or incomplete logs. Our methods use a supervised learning based model using support vector machines (SVM) to approximate recently executed queries given these memory artifacts. We extract feature vectors from the byte frequencies in a special area of the DBMS process called the sort area fragment, and use SVM to predict the type of the query operation under supervised learning. We demonstrate the capabilities and the accuracy of our methods for two representative DBMSes, PostgreSQL and MySQL. Experimental results show that, our model achieved an accuracy of 92% and 90% on MySQL and PostgreSQL datasets, respectively.
An integrated framework for improving sea level variation prediction based on the integration Wavelet-Artificial Intelligence approaches
2022, Environmental Modelling and Software
Modeling of Sea Level Variation (SLV) is a complicated phenomenon owing to multiple factors that happen at different spatial and temporal scales. Thus, this paper presents an innovative multistep interdependent framework based on Wavelet Transformation (WT) and Artificial Intelligence (AI) algorithms for SLV prediction. A wavelet time-frequency approach along with harmonic analysis is performed firstly to understand deeply SLV behavior. Then, Neighborhood Component Analysis (NCA) is applied for the Feature selection (FS) purposes. Finally, a Deep Learning Neural Network (DLNN) algorithm is utilized to predict precisely SLV based on a newly compacted dataset. The findings revealed the potential of the DLNN model over the Machine Learning models as it improves the SLV prediction accuracy by 23%. Additionally, the proposed DLNN model can predict SLV for a time horizon of three days with a correlation coefficient = 0.91 that can help in predicting SLV early for disaster management purposes.
A data-driven approach for microgrid distributed generation planning under uncertainties
2022, Applied Energy
The increasing demand for power system decarbonization and resilience raises the necessity of incorporating the renewable distributed generation (DG) into the microgrid planning. The complexity of the microgrid renewable DG planning largely roots from the intermittent wind and solar energy and load variations throughout the planning period. This paper proposes a novel two-stage data-driven adaptive robust distributed generation planning (DDARDGP) framework considering both grid-connected and islanded modes of microgrids, wherein the overall system cost is minimized. By leveraging the spatio-temporal property of historical weather and grid information, a compact uncertainty set is developed based on a data-driven Bayesian nonparametric approach. The problem is further solved by a modified column and constraint generation (CC&G) algorithm. In the study, the effectiveness of the proposed framework is demonstrated using a modified IEEE 33-bus test system. The case study considers the optimal generation sizing, allocation and mixtures. The simulation results confirm that the proposed data-driven uncertainty set adapts well to the increase of data dimensions and solves the over-conservatism issue, leading to 34.14% reduction in uncertainty estimation compared with the traditional budget uncertainty set. Accordingly, the total cost can achieve a $23,185 reduction under the proposed DDARDGP framework.
Efficient and decision boundary aware instance selection for support vector machines
2021, Information Sciences
Support vector machines (SVMs) are powerful classifiers that have high computational complexity in the training phase, which can limit their applicability to large datasets. An effective approach to address this limitation is to select a small subset of the most representative training samples such that desirable results can be obtained. In this study, a novel instance selection method called border point extraction based on locality-sensitive hashing (BPLSH) is designed. BPLSH preserves instances that are near the decision boundaries and eliminates nonessential ones. The performance of BPLSH is benchmarked against four approaches on different classification problems. The experimental results indicate that BPLSH outperforms the other methods in terms of classification accuracy, preservation rate, and execution time. The source code of BPLSH can be found in https://github.com/mohaslani/BPLSH.
Efficient computational techniques for predicting the California bearing ratio of soil in soaked conditions
2021, Engineering Geology
California bearing ratio (CBR) is one of the important parameters that is used to express the strength of the pavement subgrade of railways, roadways, and airport runways. CBR is usually determined in the laboratory in soaked conditions, which is an exhaustive and time-consuming process. Therefore, to sidestep the operation of conducting actual laboratory tests, this study presents the development of four efficient soft computing techniques, namely multivariate adaptive regression splines with piecewise linear models (MARS-L), multivariate adaptive regression splines with piecewise cubic models (MARS-C), Gaussian process regression, and genetic programming. For this purpose, a wide range of experimental results of soaked CBR was collected from an ongoing railway project of Indian Railways. Three explicit expressions are proposed to estimate the CBR of soils in soaked conditions. Separate laboratory experiments were performed to evaluate the generalization capabilities of the developed models. Furthermore, simulated datasets were used to validate the feasibility of the best-performing model. Experimental results reveal that the proposed MARS-L model attained the most accurate prediction (R² = 0.9686 and RMSE = 0.0359 against separate laboratory experiments) in predicting the soaked CBR at all stages. Based on the accuracies attained, the proposed MARS-L model is very potential to be an alternate solution to estimate the CBR value in different phases of civil engineering projects.

View all citing articles on Scopus

Xiaoou Li received her B.S. and Ph.D. degrees in applied Mathematics and Electrical Engineering from Northeastern University, China, in 1991 and 1995.

From 1995 to 1997, she was a lecturer of Electrical Engineering at the Department of Automatic Control of Northeastern University, China. From 1998 to 1999, she was an associate professor of Computer Science at the Centro de Instrumentos, Universidad Nacional Autónoma de México (UNAM), México. Since 2000, she has been a professor of the Departamento de Computación, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV-IPN), México. During the period from September 2006 to August 2007, she was a visiting professor of School of Electronics, Electrical Engineering and Computer Science, the Queen´s University of Belfast, UK.

Her research interests include Petri net theory and application, neural networks, knowledge based system, and data mining.

Wen Yu He received the B.S. degree from Tsinghua University, Beijing, China in 1990 and the M.S. and Ph.D. degrees, both in Electrical Engineering, from Northeastern University, Shenyang, China, in 1992 and 1995, respectively. From 1995 to 1996, he served as a Lecture in the Department of Automatic Control at Northeastern University, Shenyang, China. In 1996, he joined CINVESTAV-IPN, México, where he is a professor in the Departamento de Control Automático. He has held a research position with the Instituto Mexicano del Petróleo, from December 2002 to November 2003. He was a visiting senior research fellow of Queen's University Belfast from October 2006 to December 2006. He is a also a visiting professor of Northeastern University in China from 2006 to 2008. He is currently an associate editor of Neurocomputing, and International Journal of Modelling, Identification and Control. He is a senior member of IEEE. His research interests include adaptive control, neural networks, and fuzzy Control.

Kang Li is a lecturer in intelligent systems and control, Queen's University Belfast. He received B.Sc. (Xiangtan) in 1989, M.Sc. (HIT) in 1992 and Ph.D. (Shanghai Jiaotong) in 1995. He held various research positions at Shanghai Jiaotong University (1995–1996), Delft University of Technology (1997), and Queen's University Belfast (1998–2002). His research interest covers non-linear system modelling and identification, neural networks, genetic algorithms, process control, and human supervisory control. Dr. Li is a Chartered Engineer and a member of the IEEE and the InstMC.

View full text

Support vector machine classification for large data sets via minimum enclosing ball clustering

Abstract

Introduction

Section snippets

MEB clustering algorithm

SVM classification via MEB clustering

Memory space

Experimental results

Conclusion and discussion

A comparative study of ordinary cross-validation, v-Fold cross-validation and the repeated learning-testing methods

Biometrika

A study on SMO-type decomposition methods for support vector machines

IEEE Trans. Neural Networks

SVMTorch: Support vector machines for large regression problems

J. Mach. Learn. Res.

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

Fast SVM training algorithm with decomposition on very large data sets

IEEE Trans. Pattern Anal. Mach. Intell. .

GP Ensembles for Large-Scale Data Classification

IEEE Trans. Evol. Comput.

Mathematical Methods of Reliability Theory

Making large-scale support vector machine learning practice