Fast and scalable support vector clustering for large-scale data analysis

Ping, Yuan; Chang, Yun Feng; Zhou, Yajian; Tian, Ying Jie; Yang, Yi Xian; Zhang, Zhili

doi:10.1007/s10115-013-0724-9

Fast and scalable support vector clustering for large-scale data analysis

Regular Paper
Published: 17 January 2014

Volume 43, pages 281–310, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yuan Ping^1,2,
Yun Feng Chang³,
Yajian Zhou²,
Ying Jie Tian⁴,
Yi Xian Yang^2,5 &
…
Zhili Zhang¹

657 Accesses
14 Citations
Explore all metrics

Abstract

As an important boundary-based clustering algorithm, support vector clustering (SVC) benefits multiple applications for its capability of handling arbitrary cluster shapes. However, its popularity is degraded by both its highly intensive pricey computation and poor label performance which are due to redundant kernel function matrix required by estimating a support function and ineffectively checking segmers between all pairs of data points, respectively. To address these two problems, a fast and scalable SVC (FSSVC) method is proposed in this paper to achieve significant improvement on efficiency while guarantees a comparable accuracy with the state-of-the-art methods. The heart of our approach includes (1) constructing the hypersphere and support function by cluster boundaries which prunes unnecessary computation and storage of kernel functions and (2) presenting an adaptive labeling strategy which decomposes clusters into convex hulls and then employs a convex-decomposition-based cluster labeling algorithm or cone cluster labeling algorithm on the basis of whether the radius of the hypersphere is greater than 1. Both theoretical analysis and experimental results (e.g., the first rank of a nonparametric statistical test) show the superiority of our method over the others, especially for large-scale data analysis under limited memory requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Cluster Assignment Algorithm for Scaling Support Vector Clustering

CBCH (clustering-based convex hull) for reducing training time of support vector machine

Article 06 March 2019

Mixed-integer quadratic optimization and iterative clustering techniques for semi-supervised support vector machines

Article Open access 16 May 2024

References

Alzate C, Suykens JAK (2010) Multiway spectral clustering with out-of-sample extensions through weighted kernel pca. IEEE Trans Patt Anal Mach Intell 32(2):335–347
Article Google Scholar
Alzate C, Suykens JAK (2011) Sparse kernel spectral clustering models for large-scale data analysis. Neurocomputing 74(9):1382–1390
Article Google Scholar
Ban T, Abe S (2004) Spatially chunking support vector clustering algorithm. In: Proceedings of international joint conference on neural networks, pp 414–418
Ben-Hur A, Horn D, Siegelmann HT, Vapnik VN (2000) A support vector cluster method. In: Proceedings of 15th international conference on pattern recognition, vol 2, Barcelona, Spain, pp 724–727
Ben-Hur A, Horn D, Siegelmann HT, Vapnik VN (2001) Support vector clustering. J Mach Learn Res 2:125–137
Google Scholar
Bergmann G, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypotheses. In: Multiple Hypotheses Testing, vol 70, pp. 100–115
Boyd S, Vandenberghe L (2009) Convex Optimization. cambridge university press, Cambridge
Google Scholar
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Know Dis 2(2):121–167
Article Google Scholar
Camastra F, Verri A (2005) A novel kernel method for clustering. IEEE Trans Patt Anal Mach Intell 27(5):801–805
Article Google Scholar
Chiang JH, Hao PY (2003) A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing. IEEE Trans Fuzzy Syst 11(4):518–527
Article Google Scholar
Estivill-Castro V, Lee I (2000) Amoeba: Hierarchical clustering based on spatial proximity using delaunay diagram. In: Proceedings of the 9th international symposium on spatial data handling, pp 7a.26-7a.4
Estivill-Castro V, Lee I, Murray AT (2001) Criteria on proximity graphs for boundary extraction and spatial clustering. In: Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’01), vol 2035, pp 348–357
Forghani Y, Yazdi HS, Effati S (2012) An extension to fuzzy support vector data description (fsvdd*). Patt Anal Appl 15(3):237–247
Article MathSciNet Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694
MATH Google Scholar
Graven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge form the world wide web. In: Proceedings of 15th Nat’l conference for artificial intelligence (AAAI’98), pp 509–516. Madison, Wisconsin
Guo CH, Li F (2011) An improved algorithm for support vector clustering based on maximum entropy principle and kernel matrix. Exp Syst Appl 38(7):8138–8143
Article Google Scholar
Hersh WR, Buckley C, Leone TJ, Hickam DH (1994) Ohsumed: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of 17th annual ACM SIGIR conference, pp 192–201
Hubert PAL (1985) Comparing partitions. J Classif 2:193–218
Article Google Scholar
Hurley J, Garcia-Palacios E, Sezer S (2011) Classifying network protocols: a ‘two-way’ flow approach. IET Commun 5(1):79–89
Article Google Scholar
Jung KH, Kim N, Lee J (2011) Dynamic pattern denoising method using multi-basin system with kernels. Patt Recogn 44(8):1698–1707
Article MATH Google Scholar
Jung KH, Lee D, Lee J (2010) Fast support-based clustering method for large-scale problems. Patt Recogn 43(5):1975–1983
Article MATH Google Scholar
Kim HC, Lee J (2007) Clustering based on gaussian processes. Neural Comput 19(11):3088–3107
Article MATH MathSciNet Google Scholar
Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings 12th international conference machine learning (ICML’95), pp 331–339. Tahoe City.
Lee CH, Yang HC (2009) Construction of supervised and unsupervised learning systems for multilingual text categorization. Exp Syst Appl 2–1(36):2400–2410
Article Google Scholar
Lee D, Jung KH, Lee J (2009) Constructing sparse kernel machines using attractors. IEEE Trans Pat Anal Mach Intell 20(4):721–729
Google Scholar
Lee D, Lee J (2007) Equilibrium-based support vector machine for semisupervised classification. IEEE Trans Neural Netw 18(2):578–583
Article Google Scholar
Lee J, Lee D (2005) An improved cluster labeling method for support vector clustering. IEEE Trans Patt Anal Mach Intell 27(3):461–464
Article Google Scholar
Lee J, Lee D (2006) Dynamic characterization of cluster structures for robust and inductive support vector clustering. IEEE Trans Patt Anal Mach Intell 28(11):1869–1874
Article Google Scholar
Lee SH, Daniels KM (2005) Gaussian kernel width selection and fast cluster labeling for support vector clustering. Technical Report No. 2005–009, USA
Lee SH, Daniels KM (2005) Cone cluster labeling for support vector clustering. In: Proceedings of 6th SIAM conference on data mining, pp 484–488
Lewis DD (1997) Reuters-21578 text categorization collection. http://kdd.ics.uci.edu/databases/reuters21578/
Li YH (2011) Selecting training points for one-class support vector machines. Patt Recogn Lett 32(11):1517–1522
Article Google Scholar
Li YH, Maguire L (2011) Selecting critical patterns based on local geometrical and statistical information. IEEE Trans Patt Anal Mach Intell 33(6):1189–1201
Article Google Scholar
Ling P, Zhou CG, Zhou X (2010) Improved support vector clustering. Eng Appl Artif Intell 23(4):552–559
Article Google Scholar
Mehrotra S (1992) On the implementation of a primal-dual interior point method. SIAM J Optim 2: 575–601
Google Scholar
Peng JF, Zhou YJ, Wang C, Yang YX, Ping Y (2011) Early TCP traffic classification. J Appl Sci Electron Inform Eng 29(1):73–77
Google Scholar
Ping Y, Tian YJ, Zhou YJ, Yang YX (2012) Convex decomposition based cluster labeling method for support vector clustering. J Comput Sci Technol 27(2):428–442
Article MATH MathSciNet Google Scholar
Ping Y, Zhou YJ, Xue C, Yang YX (2012) Efficient representation of text with multiple perspectives. J China Uni Posts Telecommun 19(1):101–111
Google Scholar
Ping Y, Zhou YJ, Yang YX (2011) A novel scheme for accelerating support vector clustering (in press)
Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning. MIT Press, Cambridge
Puma-Villanueva WJ, Bezerra GB, Lima CAM, Zuben FJV (2005) Improving support vector clustering with ensembles. In: Proceedings of international joint conference on neural networks, Montreal, pp 13–15
Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1472
Article Google Scholar
Shamir O, Tishby N (2010) Stability and model selection in k-means clustering. Mach Learn 81(1): 213–243
Google Scholar
Sheskin D (2003) Handbook of parametric and nonparametric statistical procedures. Chapman & Hall, London
Book Google Scholar
Su MY (2011) Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification. J Netw Comput Appl 34(2):722–730
Article Google Scholar
Tax DMJ, Duin PRW (1999) Support vector domain description. Patt Recogn Lett 11–13(20):1191–1199
Article Google Scholar
UNIBS: The UNIBS anonymized 2009 internet traces (Mar 18, 2010)
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Patt Anal Mach Intell 24(9):1273–1280
Article Google Scholar
Wang CD, Lai JH (2013) Position regularized support vector domain description. Patt Recogn 46(3): 875–884
Google Scholar
Wang CD, Lai JH, Huang D, Zheng WS (2012) Svstream: A support vector based algorithm for clustering data streams. IEEE Trans Know Data Eng (in press), 1–14 . doi: 10.1109/TKDE.2011.263
Wang DF, Shi L, Yeung DS, Tsang ECC, Heng PA (2007) Ellipsoidal support vector clustering for functional MRI analysis. Patt Recogn 40(10):2685–2695
Article MATH Google Scholar
Wang JS, Chiang JC (2009) An efficient data preprocessing procedure for support vector clustering. J Univ Comput Sci 15(4):705–721
Google Scholar
Wu M, Scholkopf B (2007) A local learning approach for clustering. In: Advances in neural information processing systems (NIPS 2007), vol 19, pp 1529–1536. Vancouver, Canada (2007)
Xu R, Wunsch DC (2008) Clustering. Wiley, Hoboken
Book Google Scholar
Yang JH, Estivill-Castro V, Chalup SK (2002) Support vector clustering through proximity graph modelling. In: Proceedings of 9th international conference on neural information processing (ICONIP’02), pp 898–903. Orchid Country Club, Singapore

Download references

Acknowledgments

Useful suggestions from the referees are gratefully acknowledged. This work was supported by the grant from the National High Technology Research and Development Program of China under Grant No. 2009AA01Z430, the grant from the National Natural Science Foundation of China under Grant No. 61303232, 61121061, 60821001, 11271361 and 70921061, the Specialized Research Fund for the Doctoral Program of Higher Education under Grant No. 20120005110017, the Ministry of water resources’ special funds for scientific research on pubic causes under Grant No. 201301094, the Foundation of He’nan Educational Committee under Grant No.13A413750, 13A413747, the Natural Science Foundation of He’nan Province of China under Grant No. 122102210543 and Xuchang Municipal Natural Science Foundation under Grant No. 5018.

Author information

Authors and Affiliations

School of Information Engineering, Xuchang University, Xuchang, 461000, China
Yuan Ping & Zhili Zhang
Information Security Center, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Yuan Ping, Yajian Zhou & Yi Xian Yang
College of Science, China Three Gorges University YiChang, Hubei , 443002, China
Yun Feng Chang
Graduate University of Chinese Academy of Sciences, Beijing, 100190, China
Ying Jie Tian
School of Information Engineering, Beijing Institute of Graphic Communication, Beijing, 102600, China
Yi Xian Yang

Authors

Yuan Ping
View author publications
You can also search for this author inPubMed Google Scholar
Yun Feng Chang
View author publications
You can also search for this author inPubMed Google Scholar
Yajian Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Ying Jie Tian
View author publications
You can also search for this author inPubMed Google Scholar
Yi Xian Yang
View author publications
You can also search for this author inPubMed Google Scholar
Zhili Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yuan Ping.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ping, Y., Chang, Y.F., Zhou, Y. et al. Fast and scalable support vector clustering for large-scale data analysis. Knowl Inf Syst 43, 281–310 (2015). https://doi.org/10.1007/s10115-013-0724-9

Download citation

Received: 08 February 2013
Revised: 16 November 2013
Accepted: 04 December 2013
Published: 17 January 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10115-013-0724-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast and scalable support vector clustering for large-scale data analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Efficient Cluster Assignment Algorithm for Scaling Support Vector Clustering

CBCH (clustering-based convex hull) for reducing training time of support vector machine

Mixed-integer quadratic optimization and iterative clustering techniques for semi-supervised support vector machines

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now