research-article

The BigKClustering approach for document clustering using Hadoop MapReduce

Authors:

Sofia Megarchioti,

Basilis MamalisAuthors Info & Claims

PCI '18: Proceedings of the 22nd Pan-Hellenic Conference on Informatics

Pages 261 - 266

https://doi.org/10.1145/3291533.3291546

Published: 29 November 2018 Publication History

Abstract

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. K-Means is one of the most commonly used methods of clustering, due to its high quality results and low time cost. However, using the K-Means algorithm in document clustering over large-scale collections can lead to unpredictable time costs, since duration of a K-Means iteration tends to grow as the number of iterations grows. In this paper we first present some of the most promising alternatives for document clustering over such 'big data' (large-scale) collections. We also present our variation of an existing K-Means-based algorithm, known as BigKClustering (BKC) so that it can be applied in document clustering. The proposed adjustment of BKC is then implemented using Hadoop MapReduce to handle big (text) data collections efficiently and experimentally tested over a real cluster environment. As it comes out of the experiments, it leads to acceptable clustering quality as well as significant execution time improvements (compared to K-Means), thus constituting a promising clustering approach for big document collections.

References

[1]

Rajaraman, A., Leskovec, J., Ullman, J.D., Mining of Massive Datasets, Cambridge University Press 2010.

Digital Library

[2]

Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, et al. A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE Trans Emerg Top Comput, 2 (3), 2014.

[3]

Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: An efficient clustering algorithm for large databases. Information Systems, 26(1):35--58, 2001.

Digital Library

[4]

Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD Conference, pages 103--114, 1996.

Digital Library

[5]

W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189--206, 1984.

[6]

Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of highdimensional data sets with application to reference matching. In KDD, pages 169--178, 2000.

Digital Library

[7]

Raymond T. Ng and Jiawei Han. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5):1003--1016, 2002.

Digital Library

[8]

Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In VLDB, pages 144--155, 1994.

Digital Library

[9]

U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pages 229--238, 2009.

Digital Library

[10]

Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, pages 245--260, 1999.

Digital Library

[11]

Akthar, N., Ahamad, M.V., Ahmad, S., MapReduce Model of Improved K-Means Clustering Algorithm Using Hadoop MapReduce, Second Intl. Conf. on Computational Intelligence & Communication Technology, 2016.

[12]

Abdelrahman Elsaye, Hoda M. O. Mokhtar, and Osama Ismail. Ontology Based Document Clustering Using MapReduce, International Journal of Database Management Systems (IJDMS) Vol.7, No.2, April 2015.

[13]

Wang, S., Dutta, H., PARABLE: A PArallel RAndom-partition Based Hierarchical ClustEring Algorithm for the MapReduce Framework https://www.researchgate.net/profile/Haimonti_Dutta/publication/266422725_PARABLE_A_PArallel_RAndom-partition_Based_HierarchicaL_ClustEring_Algorithm_for_the_MapReduce_Framework/links/55508a2e08ae93634ec8e22b.pdf

[14]

Jin, C., Patwary, M.A. Agrawal, A., Hendrix, W., Liao, W., Choudhary, A., DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce https://pdfs.semanticscholar.org/237f/ba9044339b2f75b6263b9cf67f3d5c1c4f4f.pdf

[15]

V. Rastogi and et al. Finding connected components on map-reduce in logarithmic rounds. In proceedings of IEEE 29th International Conference on Data Engineering (ICDE), 2013.

Digital Library

[16]

Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder,. Parallelizing the Buckshot Algorithm for Efficient Document Clustering, in Proceedings of ACM CIKM conference, 2002.

Digital Library

[17]

Lamari, Y., Slaoui, S.C., Parallel Document Clustering using Iterative MapReduce, in Proceedings of BDAW '16 conference, November 10-11, Blagoevgrad, Bulgaria, 2016.

Digital Library

[18]

Spiros Papadimitriou Jimeng SunDisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining, Eighth IEEE International Conference on Data Mining, 2008.

Digital Library

[19]

Tanvir Habib Sardar, Ahmed Rimaz Faizabadi, Zahid Ansari. An evaluation of MapReduce framework in cluster analysis. In proceedings of IEEE International Conference on Intelligent Computing, Instrumentation and Control Technologies, Kannur, India, 2017.

[20]

Dweepna Garg, Parth Gohil, Khushboo Trivedi. Modified Fuzzy K-mean Clustering using MapReduce in Hadoop and Cloud, in Proceedings of IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 2015.

[21]

Bowen Wang, Jun Yin, Qi Hua, Zhiang Wu, Jie Cao, Parallelizing K-Means-Based Clustering on Spark, in Proceedings of International Conference on Advanced Cloud and Big Data (CBD), 2016.

[22]

Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., Liao, W., ChoudHary, A., A Scalable Hierarchical Clustering Algorithm Using Spark, in Proceedings of IEEE 1st Intl. Conf. on Big Data Computing Service and Applications, 2015.

Digital Library

[23]

Miao, Y., Zhang, J., Feng, H., Qiu, L., Wen, Y., A Fast Algorithm for Clustering with MapReduce, Advances in Neural Networks - Lecture Notes in Computer Science, vol 7951, Springer, 2013.

Digital Library

[24]

Satish Muppidi, Ramakrishna Murty, Document Clustering with Map Reduce using Hadoop Framework, International Journal on Recent and Innovation Trends in Computing and Communication, 3(1), 2015.

[25]

Jian Wan, Wenming Yu1, and Xianghua Xu, Design and Implement of Distributed Document Clustering Based on MapReduce, in Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT '09), pp. 278--280, 2009.

[26]

Zhao, W., Ma, H., He, Q., Parallel K-Means Clustering Based on MapReduce, https://www.researchgate.net/profile/Qing_He6/publication/225695804_Parallel_K-Means_Clustering_Based_on_MapReduce/links

[27]

Bishnu Prasad Gautam, Dipesh Shrestha, Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents, in Proceedings of the International MultiConference of Engineers and Computer Scientists, Volume I, 2010.

[28]

Manning, C.D., Raghavan, P., Schutze, H., Introduction to Information Retrieval, Cambridge University Press, 2008.

Digital Library

Cited By

Challa JGoyal NSharma ASreekumar NBalasubramaniam SGoyal P(2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11390-024-2700-0
Maithri CChandramouli H(2022)Parallel Agglomerative Hierarchical Clustering Algorithm Implementation with Hadoop MapReduceProceedings of 3rd International Conference on Machine Learning, Advances in Computing, Renewable Energy and Communication10.1007/978-981-19-2828-4_11(115-126)Online publication date: 18-Sep-2022
https://doi.org/10.1007/978-981-19-2828-4_11
Choubey VDubey S(2020)An Analytical Approach to Document Clustering TechniquesICT Systems and Sustainability10.1007/978-981-15-0936-0_3(35-42)Online publication date: 29-Feb-2020
https://doi.org/10.1007/978-981-15-0936-0_3

Index Terms

The BigKClustering approach for document clustering using Hadoop MapReduce
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering
2. Networks
  1. Network services
    1. Cloud computing

Recommendations

Utilizing the buckshot algorithm for efficient big data clustering in the MapReduce model
PCI '19: Proceedings of the 23rd Pan-Hellenic Conference on Informatics

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are ...
Parallel Document Clustering using Iterative MapReduce
BDAW '16: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies

Document clustering is an attractive field that interests increasingly the research community, and so giving rise to several clustering algorithms. In addition to this, document collections are expanding continuously which limits the traditional and ...
MapReduce Based Method for Big Data Semantic Clustering
SMC '13: Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics

Big data analysis is very hot in cloud computing environments. How to automatically map heterogeneous data with the same semantics is one of the key problems in big data analysis. A big data clustering method based on the MapReduce framework is proposed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

PCI '18: Proceedings of the 22nd Pan-Hellenic Conference on Informatics

November 2018

336 pages

ISBN:9781450366106

DOI:10.1145/3291533

Editors:
Karanikolas Nikitas
University of West Attica
,
Mamalis Basilis
University of West Attica
,
General Chairs:
Kontos John
University of Athens
,
Pantziou Grammati
University of West Attica
,
Gritzalis Stefanos
University of the Aegean
,
Douligeris Christos
University of Piraeus

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PCI '18

PCI '18: 22nd Pan-Hellenic Conference on Informatics

November 29 - December 1, 2018

Athens, Greece

Acceptance Rates

PCI '18 Paper Acceptance Rate 57 of 105 submissions, 54%;

Overall Acceptance Rate 190 of 390 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
72
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Challa JGoyal NSharma ASreekumar NBalasubramaniam SGoyal P(2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11390-024-2700-0
Maithri CChandramouli H(2022)Parallel Agglomerative Hierarchical Clustering Algorithm Implementation with Hadoop MapReduceProceedings of 3rd International Conference on Machine Learning, Advances in Computing, Renewable Energy and Communication10.1007/978-981-19-2828-4_11(115-126)Online publication date: 18-Sep-2022
https://doi.org/10.1007/978-981-19-2828-4_11
Choubey VDubey S(2020)An Analytical Approach to Document Clustering TechniquesICT Systems and Sustainability10.1007/978-981-15-0936-0_3(35-42)Online publication date: 29-Feb-2020
https://doi.org/10.1007/978-981-15-0936-0_3

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten