Skip to main content

Advertisement

Log in

Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Document clustering is the partitioning of a given collection of documents into various K- groups based on some similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique. In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely, Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA) based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://www.kaggle.com/benhamner/exploring-the-nips-2015-papers/data

  2. We have used python nltk toolkit [42] to remove the stop words which are 153 in numbers.

  3. Here SnowballStemmer [42] of nltk is used.

  4. https://github.com/jhlau/doc2vec

References

  1. Aggarwal CC, Zhai C. Mining text data. Berlin: Springer Science & Business Media; 2012.

    Book  Google Scholar 

  2. Al-Radaideh QA, Bataineh DQ. 2018. A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognitive Computation, 1–19.

  3. Arbelaitz O, Gurrutxaga I, Muguerza J, PéRez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern Recogn 2013;46(1):243–256.

    Article  Google Scholar 

  4. Bandyopadhyay S, Maulik U. Nonparametric genetic clustering: comparison of validity indices. IEEE Trans Syst, Man, Cybern Part C (Applications and Reviews) 2001;31(1):120–125.

    Article  Google Scholar 

  5. Bandyopadhyay S, Maulik U. Genetic clustering for automatic evolution of clusters and application to image classification. Pattern Recogn 2002;35(6):1197–1208.

    Article  Google Scholar 

  6. Bandyopadhyay S, Saha S. Gaps: a clustering method using a new point symmetry-based distance measure. Pattern Recogn 2007;40(12):3430–3451.

    Article  Google Scholar 

  7. Bandyopadhyay S, Saha S. A new principal axis based line symmetry measurement and its application to clustering. International Conference on Neural Information Processing. Springer; 2008. p. 543–550.

  8. Bandyopadhyay S, Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Trans Knowl Data Eng 2008b;20(11):1441–1457.

    Article  Google Scholar 

  9. Bandyopadhyay S, Maulik U, Mukhopadhyay A. Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Trans Geoscience Remote Sens 2007;45(5):1506–1511.

    Article  Google Scholar 

  10. Bandyopadhyay S, Saha S, Maulik U, Deb K. A simulated annealing-based multiobjective optimization algorithm: Amosa. IEEE Trans Evol Comput 2008;12(3):269–283.

    Article  Google Scholar 

  11. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res 2003;3:993–1022.

    Google Scholar 

  12. Buitelaar P, Eigner T. Topic extraction from scientific literature for competency management. The 7th International Semantic Web Conference; 2008. p. 25–66.

  13. Cardoso-Cachopo A. 2007. Improving Methods for Single-label Text Categorization PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.

  14. Carpenter MP, Narin F. Clustering of scientific journals. J Assoc Inform Sci Technol 1973;24(6):425–436.

    CAS  Google Scholar 

  15. Yw C, Zhou Q, Luo W, Du JX. Classification of chinese texts based on recognition of semantic topics. Cogn Comput 2016;8(1):114–124. https://doi.org/10.1007/s12559-015-9346-8.

    Article  Google Scholar 

  16. Das S, Abraham A, Konar A. Automatic clustering using an improved differential evolution algorithm. IEEE Trans Syst, Man, Cybern-Part A: Syst Human 2008;38(1):218–237.

    Article  Google Scholar 

  17. Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 1979;1(2):224–227. https://doi.org/10.1109/TPAMI.1979.4766909.

    Article  CAS  Google Scholar 

  18. Deb K, Vol. 16. Multi-objective optimization using evolutionary algorithms. New York: Wiley; 2001.

    Google Scholar 

  19. Deb K, Tiwari S. Omni-optimizer: a generic evolutionary algorithm for single and multi-objective optimization. Eur J Oper Res 2008;185(3):1062–1087.

    Article  Google Scholar 

  20. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 2002;6(2):182–197.

    Article  Google Scholar 

  21. Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006;7(Jan):1–30.

    Google Scholar 

  22. Doerre J, Gerstl P, Goeser S, Mueller A, Seiffert R. 2002. Taxonomy generation for document collections. US Patent 6,446,061.

  23. Dutta P, Saha S. Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput Biol Med 2017;89:31–43.

    Article  CAS  PubMed  Google Scholar 

  24. Fortuna B, Grobelnik M, Mladenic D. Visualization of text document corpus. Informatica 2005;29:4.

    Google Scholar 

  25. Goldstein J, Mittal V, Carbonell J, Kantrowitz M. Multi-document summarization by sentence extraction. Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic Summarization - Volume 4, Association for Computational Linguistics, Stroudsburg, PA, USA, NAACL-ANLP-AutoSum ’00; 2000. p. 40–48. https://doi.org/10.3115/1117575.1117580.

  26. Gu F, Liu HL, Tan KC. A multiobjective evolutionary algorithm using dynamic weight design method. Int J Innovative Comput Inf Control 2012;8:3677–3688.

    Google Scholar 

  27. Gupta V, Kaur N. A novel hybrid text summarization system for punjabi text. Cogn Comput 2016;8(2): 261–277.

    Article  Google Scholar 

  28. Handl J, Knowles J. An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 2007; 11(1):56–76.

    Article  Google Scholar 

  29. Haykin SS, Vol. 3. Neural networks and learning machines. Upper Saddle River: Pearson; 2009.

    Google Scholar 

  30. Iorio A, Li X. Rotated problems and rotationally invariant crossover in evolutionary multi-objective optimization. Int J Comput Intell Appl 2008;7(02):149–186.

    Article  Google Scholar 

  31. Jain AK, Dubes RC. Algorithms for clustering data. Upper Saddle River: Prentice-Hall, Inc; 1988.

    Google Scholar 

  32. Kashef R, Kamel MS. Enhanced bisecting k-means clustering using intermediate cooperation. Pattern Recogn 2009;42(11):2557–2569.

    Article  Google Scholar 

  33. Kennedy J. Particle swarm optimization. Encyclopedia of machine learning. Springer; 2011. p. 760–766.

  34. Kohonen T. The self-organizing map. Neurocomputing 1998;21(1):1–6.

    Article  Google Scholar 

  35. Konak A, Coit DW, Smith AE. Multi-objective optimization using genetic algorithms: a tutorial. Reliability Eng Syst Safety 2006;91(9):992–1007.

    Article  Google Scholar 

  36. Korenius T, Laurikkala J, Järvelin K, Juhola M. Stemming and lemmatization in the clustering of finnish text documents. Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM; 2004. p. 625–633.

  37. Kovács F, Legány C, Babos A. Cluster validity measurement techniques. 6th International symposium of hungarian researchers on computational intelligence; 2005.

  38. Lauren P, Qu G, Yang J, Watta P, Huang GB, Lendasse A. 2018. Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks. Cognitive Computation, 1–14.

  39. Le Q, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning (ICML-14); 2014. p. 1188–1196.

  40. Li Y, Pan Q, Yang T, Wang S, Tang J, Cambria E. Learning word representations for sentiment analysis. Cogn Comput 2017;9(6):843–851.

    Article  Google Scholar 

  41. Lichman M. 2013. UCI machine learning repository. http://archive.ics.uci.edu/ml.

  42. Loper E, Bird S. Nltk: the natural language toolkit. Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, ETMTNLP ’02; 2002. p. 63–70. https://doi.org/10.3115/1118108.1118117.

  43. Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2009.

    Google Scholar 

  44. Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 2002;24(12):1650–1654.

    Article  Google Scholar 

  45. Mikolov T, Chen K, Corrado G, Dean J. 2013. Efficient estimation of word representations in vector space. arXiv:13013781.

  46. Moran K, Wallace BC, Brodley CE. Discovering better aaai keywords via clustering with community-sourced constraints. AAAI; 2014. p. 1265–1271.

  47. Pakhira MK, Bandyopadhyay S, Maulik U. Validity index for crisp and fuzzy clusters. Pattern Recogn 2004;37(3):487–501.

    Article  Google Scholar 

  48. Pennington J, Socher R, Manning C. Glove: global vectors for word representation. Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1532–1543.

  49. Price K, Storn RM, Lampinen JA. Differential evolution: a practical approach to global optimization. Berlin: Springer Science & Business Media; 2006.

    Google Scholar 

  50. Roussinov DG, Chen H. 1998. A scalable self-organizing map algorithm for textual classification: a neural network approach to thesaurus generation.

  51. Saha S, Bandyopadhyay S. A symmetry based multiobjective clustering technique for automatic evolution of clusters. Pattern Recogn 2010;43(3):738–751.

    Article  Google Scholar 

  52. Saha S, Bandyopadhyay S. Some connectivity based cluster validity indices. Appl Soft Comput 2012;12(5): 1555–1565.

    Article  Google Scholar 

  53. Saha S, Bandyopadhyay S. A generalized automatic clustering algorithm in a multiobjective framework. Appl Soft Comput 2013;13(1):89–108.

    Article  Google Scholar 

  54. Sahi M, Gupta V. A novel technique for detecting plagiarism in documents exploiting information sources. Cogn Comput 2017;9(6):852–867.

    Article  Google Scholar 

  55. Saini N, Chourasia S, Saha S, Bhattacharyya P. A self organizing map based multi-objective framework for automatic evolution of clusters. International Conference on Neural Information Processing. Springer; 2017. p. 672–682.

  56. Saini N, Saha S, Bhattacharyya P. Cascaded Som: an improved technique for automatic email classification. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE; 2018. p. 1–8.

  57. Singh J, Gupta V. An efficient corpus-based stemmer. Cogn Comput 2017;9(5):671–688.

    Article  CAS  Google Scholar 

  58. Starczewski A. A new validity index for crisp clusters. Pattern Anal Applic 2017;20(3):687–700.

    Article  Google Scholar 

  59. Steinbach M, Karypis G, Kumar V, et al. A comparison of document clustering techniques. KDD Workshop on text mining, Boston; 2000. p. 525–526.

  60. Suresh K, Kundu D, Ghosh S, Das S, Abraham A. Data clustering using multi-objective differential evolution algorithms. Fundamenta Informaticae 2009;97(4):381–403.

    Article  Google Scholar 

  61. Wang H. 2014. Introduction to word2vec and its application to find predominant word senses. http://complinghssntuedusg/courses/hg7017/pdf/word2vec and its application to wsd pdf.

  62. Welch BL. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 1947;34(1/2):28–35. http://www.jstor.org/stable/2332510.

    Article  CAS  PubMed  Google Scholar 

  63. Witten I, Bainbridge D, Paynter G, Boddie S. 2002. Importing documents and metadata into digital libraries: requirements analysis and an extensible architecture. Research and Advanced Technology for Digital Libraries, 219–229.

  64. Xu W, Liu X, Gong Y. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM; 2003 . p. 267–273.

  65. Zhang H, Zhang X, Gao XZ, Song S. Self-organizing multiobjective optimization based on decomposition with neighborhood ensemble. Neurocomputing 2016;173:1868–1884.

    Article  Google Scholar 

  66. Zhang H, Zhou A, Song S, Zhang Q, Gao XZ, Zhang J. A self-organizing multiobjective evolutionary algorithm. IEEE Trans Evol Comput 2016;20(5):792–806. https://doi.org/10.1109/TEVC.2016.2521868.

    Article  Google Scholar 

  67. Zhou A, Qf Z, Zhang G. Multiobjective evolutionary algorithm based on mixture gaussian models. J Softw 2014;25(5):913–928.

    Google Scholar 

Download references

Acknowledgments

Dr. Sriparna Saha would like to acknowledge the support from SERB Women in Excellence Award-SB/WEA/08/2017 for conducting this particular research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Naveen Saini.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saini, N., Saha, S. & Bhattacharyya, P. Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution. Cogn Comput 11, 271–293 (2019). https://doi.org/10.1007/s12559-018-9611-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-018-9611-8

Keywords

Navigation