skip to main content
10.1145/2903220.2903255acmotherconferencesArticle/Chapter ViewAbstractPublication PagessetnConference Proceedingsconference-collections
research-article

Efficient MapReduce Kernel k-Means for Big Data Clustering

Authors Info & Claims
Published:18 May 2016Publication History

ABSTRACT

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non linear separability of the input data. It has recently received a distributed implementation, named Trimmed Kernel k-Means, following the MapReduce distributed computing model. In addition to performing the computations in a distributed manner, Trimmed Kernel k-Means also trims the kernel matrix, in order to reduce the memory requirements and improve performance. The trimming of each row of the kernel matrix is achieved by attempting to estimate the cardinality of the cluster that the corresponding sample belongs to, and removing the kernel matrix entries connecting the sample to samples that probably belong to another cluster. The Spark cluster computing framework was used for the distributed implementation. In this paper, we present a distributed clustering scheme that is based on Trimmed Kernel k-Means, which employs subsampling, in order to be able to efficiently perform clustering on an extremely large dataset. The results indicate that the proposed method run much faster than the original Trimmed Kernel k-Means, while still providing clustering performance competitive with other state of the art kernel approaches.

References

  1. D. Agrawal, S. Das, and A. El Abbadi. Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 530--533, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Aizerman, E. M. Braverman, and L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821--837, 1964.Google ScholarGoogle Scholar
  3. R. Chitta, R. Jin, T. C. Havens, and A. K. Jain. Approximate kernel k-means: solution to large scale kernel clustering. In C. Apte', J. Ghosh, and P. Smyth, editors, KDD, pages 895--903. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1944--1957, Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Ene, S. Im, and B. Moseley. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, pages 681--689, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. R. Ferreira and F. de A.T. de Carvalho. Kernel-based hard clustering methods in the feature space with automatic variable weighting. Pattern Recognition, (0):--, 2014.Google ScholarGoogle Scholar
  8. R. L. Ferreira Cordeiro, C. Traina, Junior, A. J. Machado Traina, J. López, U. Kang, and C. Faloutsos. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, pages 690--698, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Heikkila, M. Pietikainen, and C. Schmid. Description of interest regions with center-symmetric local binary patterns. In P. Kalra and S. Peleg, editors, 5th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP '06), volume 4338 of Lecture Notes in Computer Science (LNCS), pages 58--69, Madurai, India, 2006. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264--323, Sept. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Jia, Y. ming Cheung, and J. Liu. Cooperative and penalized competitive learning with application to kernel-based clustering. Pattern Recognition, (0):--, 2014.Google ScholarGoogle Scholar
  13. H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '10, pages 938--948, Philadelphia, PA, USA, 2010. Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D.-W. Kim, K. Y. Lee, D. Lee, and K. H. Lee. Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4):607--611, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. O. Kvålseth. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man, and Cybernetics, 17(3):517--519, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. Ojala, M. Pietikäinen, and D. Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 29(1):51--59, Jan. 1996.Google ScholarGoogle ScholarCross RefCross Ref
  17. L. M. Rodrigues, L. E. Zárate, C. N. Nobre, and H. C. Freitas. Parallel and distributed kmeans to identify the translation initiation site of proteins. In SMC, pages 1639--1645. IEEE, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  18. B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10(5):1299--1319, July 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Tsapanos, A. Tefas, N. Nikolaidis, and I. Pitas. A distributed framework for trimmed kernel k-means clustering. Pattern Recognition, 48(8):2685--2698, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In in Proc. IEEE Conf. Comput. Vision Pattern Recognition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Real-Life Images workshop at the European Conference on Computer Vision (ECCV), October 2008.Google ScholarGoogle Scholar
  22. S. Yu, L.-C. Tranchevent, X. Liu, W. Glanzel, J. A. Suykens, B. D. Moor, and Y. Moreau. Optimized data fusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):1031--1039, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Zhou, F. De la Torre Frade, and J. K. Hodgins. Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(3):582--596, March 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Efficient MapReduce Kernel k-Means for Big Data Clustering

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SETN '16: Proceedings of the 9th Hellenic Conference on Artificial Intelligence
      May 2016
      249 pages

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 May 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader