ABSTRACT
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non linear separability of the input data. It has recently received a distributed implementation, named Trimmed Kernel k-Means, following the MapReduce distributed computing model. In addition to performing the computations in a distributed manner, Trimmed Kernel k-Means also trims the kernel matrix, in order to reduce the memory requirements and improve performance. The trimming of each row of the kernel matrix is achieved by attempting to estimate the cardinality of the cluster that the corresponding sample belongs to, and removing the kernel matrix entries connecting the sample to samples that probably belong to another cluster. The Spark cluster computing framework was used for the distributed implementation. In this paper, we present a distributed clustering scheme that is based on Trimmed Kernel k-Means, which employs subsampling, in order to be able to efficiently perform clustering on an extremely large dataset. The results indicate that the proposed method run much faster than the original Trimmed Kernel k-Means, while still providing clustering performance competitive with other state of the art kernel approaches.
- D. Agrawal, S. Das, and A. El Abbadi. Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 530--533, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. Aizerman, E. M. Braverman, and L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821--837, 1964.Google Scholar
- R. Chitta, R. Jin, T. C. Havens, and A. K. Jain. Approximate kernel k-means: solution to large scale kernel clustering. In C. Apte', J. Ghosh, and P. Smyth, editors, KDD, pages 895--903. ACM, 2011. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarDigital Library
- I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1944--1957, Nov. 2007. Google ScholarDigital Library
- A. Ene, S. Im, and B. Moseley. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, pages 681--689, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- M. R. Ferreira and F. de A.T. de Carvalho. Kernel-based hard clustering methods in the feature space with automatic variable weighting. Pattern Recognition, (0):--, 2014.Google Scholar
- R. L. Ferreira Cordeiro, C. Traina, Junior, A. J. Machado Traina, J. López, U. Kang, and C. Faloutsos. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, pages 690--698, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarDigital Library
- M. Heikkila, M. Pietikainen, and C. Schmid. Description of interest regions with center-symmetric local binary patterns. In P. Kalra and S. Peleg, editors, 5th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP '06), volume 4338 of Lecture Notes in Computer Science (LNCS), pages 58--69, Madurai, India, 2006. Springer-Verlag. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264--323, Sept. 1999. Google ScholarDigital Library
- H. Jia, Y. ming Cheung, and J. Liu. Cooperative and penalized competitive learning with application to kernel-based clustering. Pattern Recognition, (0):--, 2014.Google Scholar
- H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '10, pages 938--948, Philadelphia, PA, USA, 2010. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- D.-W. Kim, K. Y. Lee, D. Lee, and K. H. Lee. Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4):607--611, 2005. Google ScholarDigital Library
- T. O. Kvålseth. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man, and Cybernetics, 17(3):517--519, 1987.Google ScholarCross Ref
- T. Ojala, M. Pietikäinen, and D. Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 29(1):51--59, Jan. 1996.Google ScholarCross Ref
- L. M. Rodrigues, L. E. Zárate, C. N. Nobre, and H. C. Freitas. Parallel and distributed kmeans to identify the translation initiation site of proteins. In SMC, pages 1639--1645. IEEE, 2012.Google ScholarCross Ref
- B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10(5):1299--1319, July 1998. Google ScholarDigital Library
- N. Tsapanos, A. Tefas, N. Nikolaidis, and I. Pitas. A distributed framework for trimmed kernel k-means clustering. Pattern Recognition, 48(8):2685--2698, 2015. Google ScholarDigital Library
- L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In in Proc. IEEE Conf. Comput. Vision Pattern Recognition, 2011. Google ScholarDigital Library
- L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life Images workshop at the European Conference on Computer Vision (ECCV), October 2008.Google Scholar
- S. Yu, L.-C. Tranchevent, X. Liu, W. Glanzel, J. A. Suykens, B. D. Moor, and Y. Moreau. Optimized data fusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):1031--1039, 2012. Google ScholarDigital Library
- F. Zhou, F. De la Torre Frade, and J. K. Hodgins. Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(3):582--596, March 2013. Google ScholarDigital Library
- Efficient MapReduce Kernel k-Means for Big Data Clustering
Recommendations
A distributed framework for trimmed Kernel k-Means clustering
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. Kernel k-Means is a state of the art ...
The global kernel k-means algorithm for clustering in feature space
Kernel k-means is an extension of the standard k-means clustering algorithm that identifies nonlinearly separable clusters. In order to overcome the cluster initialization problem associated with this method, we propose the global kernel k-means ...
Utilizing the buckshot algorithm for efficient big data clustering in the MapReduce model
PCI '19: Proceedings of the 23rd Pan-Hellenic Conference on InformaticsClustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are ...
Comments