Abstract
Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Al-Maitah M (2019) Text analytics for big data using rough-fuzzy soft computing techniques. Expert Syst 36(6):e12463
Arin I, Erpam MK, Saygin Y (2018) I-TWEC: interactive clustering tool for Twitter. Expert Syst Appl 96:1–13
Attenberg J, Weinberger K, Dasgupta A, Smola A, Zinkevich M (2009) Collaborative email-spam filtering with the hashing trick. In: The sixth conference on Email and anti-spam
Bejos S, Feliciano-Avelino I, Martínez-Trinidad JF, Carrasco-Ochoa JA (2020) Improved fast partitional clustering algorithm for text clustering. J Intell Fuzzy Syst 39(2): 1–9
Ben HajKacem MA, Ben N’Cir CE, Essoussi N (2019) One-pass MapReduce-based clustering method for mixed large scale data. J Intell Inf Syst 52(3):619–636
Ben HajKacem MA, Ben N’Cir CE, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Clustering methods for big data analytics. Springer, pp 1–23
Ben N’Cir CE, Essoussi N (2015) Using sequences of words for non-disjoint grouping of documents. Int J Pattern Recognit Artif Intell 29(3):1–20
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: The seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 245–250
Caragea C, Silvescu A, Mitra P (2012) Combining hashing and abstraction in sparse high dimensional feature spaces. In: The advancement of artificial intelligence AAAI
Choi FY, Wiemer-Hastings P, Moore J (2001) Latent semantic analysis for text segmentation. In: The conference on empirical methods in natural language processing
Choi DW, Chung CW (2017) A K-partitioning algorithm for clustering large-scale spatio-textual data. Inf Syst 64:1–11
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
Dasgupta A, Kumar R, Sarlós T (2010) A sparse johnson: Lindenstrauss transform. In: The forty-second ACM symposium on Theory of computing, ACM, pp 341–350
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: The 20th international conference on machine learning, pp 186–193
Fradkin D, Madigan D (2003) Experiments with random projections for machine learning. In: The ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 517–522
Fraj M, Hajkacem MAB, Essoussi N (2018) A novel tweets clustering method using word embeddings. In: The IEEE/ACS 15th international conference on computer systems and applications (AICCSA), IEEE, pp 1–7
Irandoost MA, Rahmani AM, Setayeshi S (2019) MapReduce data skewness handling: a systematic literature review. Int J Parallel Program 47(5–6):907–950
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Hassan MT, Karim A, Kim JB, Jeon M (2015) CDIM: document clustering by discrimination information maximization. Inf Sci 316(2015):87–106
Hussain SF, Mushtaq M, Halim Z (2014) Multi-view document clustering via ensemble method. J Intell Inf Syst 43(1):81–99
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Jun S, Park SS, Jang DS (2014) Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst Appl 41(7):3204–3212
Kowalski R, Hayes PJ (1968) Semantic trees in automatic theorem proving. Edinburgh University, Edinburgh
Kushwaha N, Pant M (2018) Link based BPSO for feature selection in big data text clustering. Future Gener Comput Syst 82(2018):190–199
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24
Li Y, Luo C, Chung SM (2015) A parallel text document clustering algorithm based on neighbors. Clust Comput 18(2):933–948
Lin J (2013) Mapreduce is good enough? if all you have is a hammer, throw away everything that’s not a nail!. Big Data 1(1):28–37
Liu G, Wang Y, Zhao T, Li D (2011) Research on the parallel text clustering algorithm based on the semantic tree. In: The 6th international conference on computer sciences and convergence information technology (ICCIT), IEEE, pp 400–403
Ma Y, Wang Y, Jin B (2014) A three-phase approach to document clustering based on topic significance degree. Expert Syst Appl 41(18):8203–8210
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 14(1):281–297
Papadopoulos A, Pallis G, Dikaiakos MD (2017) Weighted clustering of attributed multi-graphs. Computing 99(9):813–840
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of. Addison-Wesley, Reading
Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Lin CT (2017) A review of clustering techniques and developments. Neurocomputing 267(2017):664–681
Schütze H, Silverstein C (1997) Projections for efficient document clustering. In: ACM SIGIR Forum, ACM, pp 74–81
Sculley, D. (2010) Web-scale k-means clustering. In: The 19th international conference on World wide web, ACM, pp 1177–1178
Sinha A, Jana PK (2018) A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets. J Supercomput 74(4):1562–1579
Singh D, Reddy CK (2015) A survey on platforms for big data analytics. J Big Data 2(1):8
Shahnaz F, Berry MW, Pauca VP, Plemmons RJ (2006) Document clustering using nonnegative matrix factorization. Inf Process Manag 42(2):373–386
Shi Q, Petterson J, Dror G, Langford J, Smola A, Vishwanathan SVN (2009) Hash kernels for structured data. J Mach Learn Res 10(2009):2615–2637
Song W, Park SC (2007) A novel document clustering model based on latent semantic analysis. In: The third international conference on semantics. Knowledge and grid, IEEE, pp 539–542
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop Text Min 400(1):525–526
Sun Z, Fox G, Gu W, Li Z (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467
Tagarelli A, Karypis G (2013) A segment-based approach to clustering multi-topic documents. Knowl Inf Syst 34(3):563–595
Victor GS, Antonia P, Spyros S (2014) CSMR: a scalable algorithm for text clustering with cosine similarity and mapreduce. In: The IFIP international conference on artificial intelligence applications and innovations. Springer, pp 211–220
Wang P, Xu B, Xu J, Tian G, Liu CL, Hao H (2016) Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174(2016):806–814
Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using WordNet and lexical chains. Expert Syst Appl 42(4):2264–2275
White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc, Sebastopol
Xu Y, Qu W, Li Z, Min G, Li K, Liu Z (2014) Efficient k-Means++ approximation with MapReduce. IEEE Trans Parallel Distrib Syst 25(12):3135–3144
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10):95
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: The IEEE international conference on cloud computing. Springer, pp 674-679
Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized method based on TF-IDF algorithm on Spark. Comput Mater Continua 62(1):217–231
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ben HajKacem, M.A., Ben N’cir, CE. & Essoussi, N. A parallel text clustering method using Spark and hashing. Computing 103, 2007–2031 (2021). https://doi.org/10.1007/s00607-021-00932-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-021-00932-y