research-article

Efficient MapReduce Kernel k-Means for Big Data Clustering

Authors:
Nikolaos Tsapanos

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
View Profile

,
Anastasios Tefas

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
View Profile

,
Nikolaos Nikolaidis

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
View Profile

,
Ioannis Pitas

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
View Profile

SETN '16: Proceedings of the 9th Hellenic Conference on Artificial IntelligenceMay 2016Article No.: 28Pages 1–5https://doi.org/10.1145/2903220.2903255

Published:18 May 2016Publication History

SETN '16: Proceedings of the 9th Hellenic Conference on Artificial Intelligence

Pages 1–5

ABSTRACT

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non linear separability of the input data. It has recently received a distributed implementation, named Trimmed Kernel k-Means, following the MapReduce distributed computing model. In addition to performing the computations in a distributed manner, Trimmed Kernel k-Means also trims the kernel matrix, in order to reduce the memory requirements and improve performance. The trimming of each row of the kernel matrix is achieved by attempting to estimate the cardinality of the cluster that the corresponding sample belongs to, and removing the kernel matrix entries connecting the sample to samples that probably belong to another cluster. The Spark cluster computing framework was used for the distributed implementation. In this paper, we present a distributed clustering scheme that is based on Trimmed Kernel k-Means, which employs subsampling, in order to be able to efficiently perform clustering on an extremely large dataset. The results indicate that the proposed method run much faster than the original Trimmed Kernel k-Means, while still providing clustering performance competitive with other state of the art kernel approaches.

References

D. Agrawal, S. Das, and A. El Abbadi. Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 530--533, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
A. Aizerman, E. M. Braverman, and L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821--837, 1964.Google Scholar
R. Chitta, R. Jin, T. C. Havens, and A. K. Jain. Approximate kernel k-means: solution to large scale kernel clustering. In C. Apte', J. Ghosh, and P. Smyth, editors, KDD, pages 895--903. ACM, 2011. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarDigital Library
I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1944--1957, Nov. 2007. Google ScholarDigital Library
A. Ene, S. Im, and B. Moseley. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, pages 681--689, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
M. R. Ferreira and F. de A.T. de Carvalho. Kernel-based hard clustering methods in the feature space with automatic variable weighting. Pattern Recognition, (0):--, 2014.Google Scholar
R. L. Ferreira Cordeiro, C. Traina, Junior, A. J. Machado Traina, J. López, U. Kang, and C. Faloutsos. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, pages 690--698, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarDigital Library
M. Heikkila, M. Pietikainen, and C. Schmid. Description of interest regions with center-symmetric local binary patterns. In P. Kalra and S. Peleg, editors, 5th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP '06), volume 4338 of Lecture Notes in Computer Science (LNCS), pages 58--69, Madurai, India, 2006. Springer-Verlag. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264--323, Sept. 1999. Google ScholarDigital Library
H. Jia, Y. ming Cheung, and J. Liu. Cooperative and penalized competitive learning with application to kernel-based clustering. Pattern Recognition, (0):--, 2014.Google Scholar
H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '10, pages 938--948, Philadelphia, PA, USA, 2010. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
D.-W. Kim, K. Y. Lee, D. Lee, and K. H. Lee. Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4):607--611, 2005. Google ScholarDigital Library
T. O. Kvålseth. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man, and Cybernetics, 17(3):517--519, 1987.Google ScholarCross Ref
T. Ojala, M. Pietikäinen, and D. Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 29(1):51--59, Jan. 1996.Google ScholarCross Ref
L. M. Rodrigues, L. E. Zárate, C. N. Nobre, and H. C. Freitas. Parallel and distributed kmeans to identify the translation initiation site of proteins. In SMC, pages 1639--1645. IEEE, 2012.Google ScholarCross Ref
B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10(5):1299--1319, July 1998. Google ScholarDigital Library
N. Tsapanos, A. Tefas, N. Nikolaidis, and I. Pitas. A distributed framework for trimmed kernel k-means clustering. Pattern Recognition, 48(8):2685--2698, 2015. Google ScholarDigital Library
L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In in Proc. IEEE Conf. Comput. Vision Pattern Recognition, 2011. Google ScholarDigital Library
L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life Images workshop at the European Conference on Computer Vision (ECCV), October 2008.Google Scholar
S. Yu, L.-C. Tranchevent, X. Liu, W. Glanzel, J. A. Suykens, B. D. Moor, and Y. Moreau. Optimized data fusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):1031--1039, 2012. Google ScholarDigital Library
F. Zhou, F. De la Torre Frade, and J. K. Hodgins. Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(3):582--596, March 2013. Google ScholarDigital Library

Efficient MapReduce Kernel k-Means for Big Data Clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

A distributed framework for trimmed Kernel k-Means clustering

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. Kernel k-Means is a state of the art ...
Read More
The global kernel k-means algorithm for clustering in feature space

Kernel k-means is an extension of the standard k-means clustering algorithm that identifies nonlinearly separable clusters. In order to overcome the cluster initialization problem associated with this method, we propose the global kernel k-means ...
Read More
Utilizing the buckshot algorithm for efficient big data clustering in the MapReduce model
PCI '19: Proceedings of the 23rd Pan-Hellenic Conference on Informatics

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SETN '16: Proceedings of the 9th Hellenic Conference on Artificial Intelligence
May 2016
249 pages
ISBN:9781450337342
DOI:10.1145/2903220
Editors:
Nick Bassiliades,
Antonis Bikakis,
Dimitrios Vrakas,
Ioannis Vlahavas,
George Vouros
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 May 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Big Data
Kernel k-Means
MapReduce
clustering
distributed computing
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 210
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient MapReduce Kernel k-Means for Big Data Clustering

SETN '16: Proceedings of the 9th Hellenic Conference on Artificial Intelligence

ABSTRACT

References

Cited By

Recommendations

A distributed framework for trimmed Kernel k-Means clustering

The global kernel k-means algorithm for clustering in feature space

Utilizing the buckshot algorithm for efficient big data clustering in the MapReduce model