Skip to main content

Advertisement

Log in

A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Document clustering is a well established technique used to segregate voluminous text corpora into distinct categories. In this paper we present an improved algorithm for clustering large text corpus. The proposed algorithm tries to overcome the challenges of clustering large corpora, while maintaining high ”goodness” values for the proposed clusters. The algorithm proceeds by optimizing a fitness function using Differential Evolution to form the initial clusters. The clusters obtained after the initial phase are then “refined” by re-evaluating the points that fall at the fringes of the clusters and reassigning them to other clusters, if necessary. Two different approaches e.g. Nearest Cluster Based Re-evaluation (N-CBR) and Multiple Cluster Based Re-evaluation (M-CBR) have been proposed to select candidates during the reassignment phase and their performances have been evaluated. The result of such a post processing phase has been demonstrated on a number of standard benchmark text corpora and the algorithm is found to be quite accurate and efficient. The results obtained by the proposed method have also been compared to other evolutionary strategies e.g. Genetic Algorithm(GA), Particle Swarm Optimization(PSO), Harmony Search(HS), and have been found to be quite satisfactory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 2
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://sites.labic.icmc..usp.br/text_collections

  2. http://mlg.ucd.ie/datasets/bbc.html

  3. https://conda.io/miniconda.html

References

  1. Abbasi AA, Younis M (2007) A survey on clustering algorithms for wireless sensor networks. Comput Commun 30(14):2826–2841. https://doi.org/10.1109/NBiS.2010.59

    Article  Google Scholar 

  2. Abraham A, Das S, Konar A (2006) Document clustering using differential evolution. In: IEEE congress on evolutionary computation, 2006, (CEC 2006). IEEE, pp 1784–1791, DOI https://doi.org/10.1109/CEC.2006.1688523, (to appear in print)

  3. Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36. https://doi.org/10.1016/j.eswa.2017.05.002

    Article  Google Scholar 

  4. Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Berlin, Springer, pp 1–165

    Google Scholar 

  5. Arellano-Verdejo J, Alba E, Godoy-Calderon S (2016) Efficiently finding the optimum number of clusters in a dataset with a new hybrid differential evolution algorithm. Dela Soft Comput 20(3):895–905

    Article  Google Scholar 

  6. Chien YC, Lui MC, Wu TT (2014) Discussion-record-based prediction model for creativity education using clustering methods. In: Thinking skills and creativity, vol 36. Elsevier, p 100650

  7. Chu TZ, Cheng L, Hau SW (2018) Corpus-based topic diffusion for short text clustering. Neurocomputing 275:2444–2458

    Article  Google Scholar 

  8. Cobos C, Muñoz-Collazos H, Urbano-Muñoz R, Mendoza M, León E, Herrera-Viedma E (2014) Clustering of web search results based on the cuckoo search algorithm and balanced bayesian information criterion. Inf Sci 281:248–264

    Article  Google Scholar 

  9. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27

    Article  MATH  Google Scholar 

  10. Cutting DR, Karger DR, Pedersen JO, Tukey JW (2017) Scatter/gather: a cluster-based approach to browsing large document collections. In: ACM SIGIR Forum, ACM, vol 51, pp 148–159

  11. Deng C, Liang CY, Zhao B, Yang Y, Deng AY (2011) Structure-encoding differential evolution for integer programming. JSW 6(1):140–147

    Article  Google Scholar 

  12. Dong J, Wang F, Yuan B (2013) Accelerating birch for clustering large scale streaming data using cuda dynamic parallelism. In: International conference on intelligent data engineering and automated learning. Springer, pp 409–416

  13. Dong L, Wang L, Khahro SF, Gao S, Liao X (2016) Wind power day-ahead prediction with cluster analysis of NWP. Renew Sust Energ Rev 60:1206–1212

    Article  Google Scholar 

  14. Du R, Kuang D, Drake B, Park H (2017) DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling. J Glob Optim, 1–22

  15. Feoktistov V (2006) Differential evolution, in search of solutions. Springer, Berlin

    MATH  Google Scholar 

  16. Forsati R, Mahdavi M, Shamsfard M, Meybodi MR (2013) Efficient stochastic algorithms for document clustering. Inf Sci 220:269–291

    Article  MathSciNet  Google Scholar 

  17. Gawad C, Koh W, Quake SR (2016) Single-cell genome sequencing: current state of the science. Nat Rev Genet 17(3):175

    Article  Google Scholar 

  18. Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using knn model for automatic text categorization. Soft Comput 10(5):423–430

    Article  Google Scholar 

  19. Han J, Micheline K (2007) Data mining concepts and techniques. Morgan Kaufmann, Burlington

    MATH  Google Scholar 

  20. Handl J, Meyer B (2007) Ant-based and swarm-based clustering. Swarm Intell 1(2):95–113

    Article  Google Scholar 

  21. Hatamlou A (2013) Black hole: a new heuristic optimization approach for data clustering. Information sciences 222:175–184

    Article  MathSciNet  Google Scholar 

  22. He Z, Yu C (2019) Clustering stability-based evolutionary k-means. Soft Comput 23(1):305–321

    Article  MATH  Google Scholar 

  23. Huang S, Xu Z, Lv J (2018) Adaptive local structure learning for document co-clustering. Knowl-Based Syst 148:74–84

    Article  Google Scholar 

  24. Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008). Christchurch, New Zealand, pp 49–56

  25. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  26. Jensi R, Jiji DGW (2014) A survey on optimization approaches to text document clustering. arXiv:14012229

  27. Jun S, Park SS, Jang DS (2014) Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst Appl 41(7):3204–3212

    Article  Google Scholar 

  28. Kamel N, Ouchen I, Baali K (2014) A sampling PSO-k-means algorithm for document clustering. In: Genetic and evolutionary computing. Springer, pp 45–54

  29. Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of intelligent optimization in biology and medicine. Springer, pp 267–287

  30. Kaur SP, Madan N (2016) Document clustering using firefly algorithm. Artif Intell Syst Machine Learn 8(5):182–185

    Google Scholar 

  31. Kinariwala S, Kulkarni BM (2015) Text summarization using fuzzy relational clustering algorithm. Int J Scientif Res Educ, 4370–4378

  32. Li X, He T, Ran H, Lu X (2016) A novel graph partitioning criterion based short text clustering method. In: International conference on intelligent computing. Springer, pp 338–348

  33. Lulli A, Debatty T, Dell’Amico M, Michiardi P, Ricci L (2015) Scalable k-nn based text clustering. In: Big data (big data) 2015 IEEE International Conference on. IEEE, pp 958-963

  34. Maulik U, Saha I (2010) Automatic fuzzy clustering using modified differential evolution for image classification. IEEE transactions on Geoscience and Remote Sensing 48(9):3503–3510

    Article  Google Scholar 

  35. Moftah HM, Azar AT, Al-Shammari ET, Ghali NI, Hassanien AE, Shoman M (2014) Adaptive k-means clustering algorithm for MR breast image segmentation. Neural Comput Applic 24(7-8):1917–1928

    Article  Google Scholar 

  36. Mukherjee H, Obaidullah SM, Santosh KC, Phadikar S, Roy K (2020) A lazy learning-based language identification from speech using MFCC-2 features. Int J Machine Learn Cybern 11(1):1–14

    Article  Google Scholar 

  37. Mustafi D, Sahoo G (2018) A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering. Soft Comput, 1–18

  38. Nie L, Zhao Y, Mohammad A, Shen J, Chua TS (2014) Bridging the vocabulary gap between health seekers and healthcare knowledge. In: IEEE Transactions on Knowledge and Data Engineering (TKDE), vol 27, pp 1041–4347

  39. Patibandla RS, Veeranjaneyulu N (2018) Performance analysis of partition and evolutionary clustering methods on various cluster validation criteria. Arab J Sci Eng 43(8):4379–90

    Article  Google Scholar 

  40. Peng T, Liu L (2015) A novel incremental conceptual hierarchical text clustering method using cfu-tree. Appl Soft Comput 27:269–278

    Article  Google Scholar 

  41. Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25

    Article  Google Scholar 

  42. Rüger SM, Gauch SE, et al. (2000) Feature reduction for document clustering and classification. Department of Computing. Imperial College of Science, Technology and Medicine

  43. Saini N, Saha S, Bhattacharyya P (2019) Automatic scientific document clustering using self-organized multi-objective differential evolution. Cognit Comput 11(2):271–293

    Article  Google Scholar 

  44. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  45. Selosse M, Jacques J, Biernacki C (2020) Textual data summarization using the Self-Organized Co-Clustering model. Pattern Recogn 103:107315

    Article  Google Scholar 

  46. Shanmugam Devi A, Siamala S, Dhivya Prabha E (2015) A proficient method for text clustering using harmony search method. Int J Sci Res Sci Eng Technol

  47. Sherar M, Zulkernine F (2017) Particle swarm optimization for large-scale clustering on apache spark. In: IEEE symposium series on computational intelligence (SSCI), pp 1–8

  48. Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200

    Article  Google Scholar 

  49. Steinbach M, Karypis G, Kumar V et al (2000) A comparison of document clustering techniques. In: KDD Workshop on text mining, boston, vol 400, pp 525–526

  50. Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 11(4):341–359

    Article  MathSciNet  MATH  Google Scholar 

  51. Verma P, Verma A, Pal S (2022) An approach for extractive text summarization using fuzzy evolutionary and clustering algorithms. Appl Soft Comput 8:108670

    Article  Google Scholar 

  52. Willett P (2006) The porter stemming algorithm: then and now. Program 40(3):219–223

    Article  Google Scholar 

  53. Xu Q, He D, Zhang N, Kang C, Xia Q, Bai J, Huang J (2015) A short-term wind power forecasting approach with adjustment of numerical weather prediction input by data mining. IEEE Transactions on Sustainable Energy 6(4):1283–1291

    Article  Google Scholar 

  54. Yan Y, Chen L, Tjhi WC (2013) Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst 215:74–89

    Article  MathSciNet  Google Scholar 

  55. Zaki MJ, Meira W Jr, Meira W (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Mustafi.

Ethics declarations

Conflict of Interests

The authors hereby declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mustafi, D., Mustafi, A. A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points. Multimed Tools Appl 82, 32177–32201 (2023). https://doi.org/10.1007/s11042-023-14716-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14716-3

Keywords

Navigation