skip to main content
10.1145/3291533.3291546acmotherconferencesArticle/Chapter ViewAbstractPublication PagespciConference Proceedingsconference-collections
research-article

The BigKClustering approach for document clustering using Hadoop MapReduce

Published: 29 November 2018 Publication History

Abstract

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. K-Means is one of the most commonly used methods of clustering, due to its high quality results and low time cost. However, using the K-Means algorithm in document clustering over large-scale collections can lead to unpredictable time costs, since duration of a K-Means iteration tends to grow as the number of iterations grows. In this paper we first present some of the most promising alternatives for document clustering over such 'big data' (large-scale) collections. We also present our variation of an existing K-Means-based algorithm, known as BigKClustering (BKC) so that it can be applied in document clustering. The proposed adjustment of BKC is then implemented using Hadoop MapReduce to handle big (text) data collections efficiently and experimentally tested over a real cluster environment. As it comes out of the experiments, it leads to acceptable clustering quality as well as significant execution time improvements (compared to K-Means), thus constituting a promising clustering approach for big document collections.

References

[1]
Rajaraman, A., Leskovec, J., Ullman, J.D., Mining of Massive Datasets, Cambridge University Press 2010.
[2]
Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, et al. A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE Trans Emerg Top Comput, 2 (3), 2014.
[3]
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: An efficient clustering algorithm for large databases. Information Systems, 26(1):35--58, 2001.
[4]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD Conference, pages 103--114, 1996.
[5]
W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189--206, 1984.
[6]
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of highdimensional data sets with application to reference matching. In KDD, pages 169--178, 2000.
[7]
Raymond T. Ng and Jiawei Han. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5):1003--1016, 2002.
[8]
Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In VLDB, pages 144--155, 1994.
[9]
U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pages 229--238, 2009.
[10]
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, pages 245--260, 1999.
[11]
Akthar, N., Ahamad, M.V., Ahmad, S., MapReduce Model of Improved K-Means Clustering Algorithm Using Hadoop MapReduce, Second Intl. Conf. on Computational Intelligence & Communication Technology, 2016.
[12]
Abdelrahman Elsaye, Hoda M. O. Mokhtar, and Osama Ismail. Ontology Based Document Clustering Using MapReduce, International Journal of Database Management Systems (IJDMS) Vol.7, No.2, April 2015.
[13]
Wang, S., Dutta, H., PARABLE: A PArallel RAndom-partition Based Hierarchical ClustEring Algorithm for the MapReduce Framework https://www.researchgate.net/profile/Haimonti_Dutta/publication/266422725_PARABLE_A_PArallel_RAndom-partition_Based_HierarchicaL_ClustEring_Algorithm_for_the_MapReduce_Framework/links/55508a2e08ae93634ec8e22b.pdf
[14]
Jin, C., Patwary, M.A. Agrawal, A., Hendrix, W., Liao, W., Choudhary, A., DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce https://pdfs.semanticscholar.org/237f/ba9044339b2f75b6263b9cf67f3d5c1c4f4f.pdf
[15]
V. Rastogi and et al. Finding connected components on map-reduce in logarithmic rounds. In proceedings of IEEE 29th International Conference on Data Engineering (ICDE), 2013.
[16]
Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder,. Parallelizing the Buckshot Algorithm for Efficient Document Clustering, in Proceedings of ACM CIKM conference, 2002.
[17]
Lamari, Y., Slaoui, S.C., Parallel Document Clustering using Iterative MapReduce, in Proceedings of BDAW '16 conference, November 10-11, Blagoevgrad, Bulgaria, 2016.
[18]
Spiros Papadimitriou Jimeng SunDisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining, Eighth IEEE International Conference on Data Mining, 2008.
[19]
Tanvir Habib Sardar, Ahmed Rimaz Faizabadi, Zahid Ansari. An evaluation of MapReduce framework in cluster analysis. In proceedings of IEEE International Conference on Intelligent Computing, Instrumentation and Control Technologies, Kannur, India, 2017.
[20]
Dweepna Garg, Parth Gohil, Khushboo Trivedi. Modified Fuzzy K-mean Clustering using MapReduce in Hadoop and Cloud, in Proceedings of IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 2015.
[21]
Bowen Wang, Jun Yin, Qi Hua, Zhiang Wu, Jie Cao, Parallelizing K-Means-Based Clustering on Spark, in Proceedings of International Conference on Advanced Cloud and Big Data (CBD), 2016.
[22]
Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., Liao, W., ChoudHary, A., A Scalable Hierarchical Clustering Algorithm Using Spark, in Proceedings of IEEE 1st Intl. Conf. on Big Data Computing Service and Applications, 2015.
[23]
Miao, Y., Zhang, J., Feng, H., Qiu, L., Wen, Y., A Fast Algorithm for Clustering with MapReduce, Advances in Neural Networks - Lecture Notes in Computer Science, vol 7951, Springer, 2013.
[24]
Satish Muppidi, Ramakrishna Murty, Document Clustering with Map Reduce using Hadoop Framework, International Journal on Recent and Innovation Trends in Computing and Communication, 3(1), 2015.
[25]
Jian Wan, Wenming Yu1, and Xianghua Xu, Design and Implement of Distributed Document Clustering Based on MapReduce, in Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT '09), pp. 278--280, 2009.
[26]
Zhao, W., Ma, H., He, Q., Parallel K-Means Clustering Based on MapReduce, https://www.researchgate.net/profile/Qing_He6/publication/225695804_Parallel_K-Means_Clustering_Based_on_MapReduce/links
[27]
Bishnu Prasad Gautam, Dipesh Shrestha, Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents, in Proceedings of the International MultiConference of Engineers and Computer Scientists, Volume I, 2010.
[28]
Manning, C.D., Raghavan, P., Schutze, H., Introduction to Information Retrieval, Cambridge University Press, 2008.

Cited By

View all
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • (2022)Parallel Agglomerative Hierarchical Clustering Algorithm Implementation with Hadoop MapReduceProceedings of 3rd International Conference on Machine Learning, Advances in Computing, Renewable Energy and Communication10.1007/978-981-19-2828-4_11(115-126)Online publication date: 18-Sep-2022
  • (2020)An Analytical Approach to Document Clustering TechniquesICT Systems and Sustainability10.1007/978-981-15-0936-0_3(35-42)Online publication date: 29-Feb-2020

Index Terms

  1. The BigKClustering approach for document clustering using Hadoop MapReduce

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      PCI '18: Proceedings of the 22nd Pan-Hellenic Conference on Informatics
      November 2018
      336 pages
      ISBN:9781450366106
      DOI:10.1145/3291533
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 November 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Apache Hadoop
      2. BigKClustering
      3. K-Means
      4. MapReduce
      5. big data
      6. document clustering

      Qualifiers

      • Research-article

      Conference

      PCI '18
      PCI '18: 22nd Pan-Hellenic Conference on Informatics
      November 29 - December 1, 2018
      Athens, Greece

      Acceptance Rates

      PCI '18 Paper Acceptance Rate 57 of 105 submissions, 54%;
      Overall Acceptance Rate 190 of 390 submissions, 49%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
      • (2022)Parallel Agglomerative Hierarchical Clustering Algorithm Implementation with Hadoop MapReduceProceedings of 3rd International Conference on Machine Learning, Advances in Computing, Renewable Energy and Communication10.1007/978-981-19-2828-4_11(115-126)Online publication date: 18-Sep-2022
      • (2020)An Analytical Approach to Document Clustering TechniquesICT Systems and Sustainability10.1007/978-981-15-0936-0_3(35-42)Online publication date: 29-Feb-2020

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media