Abstract
Document clustering is the partitioning of a given collection of documents into various K- groups based on some similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique. In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely, Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA) based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Aggarwal CC, Zhai C. Mining text data. Berlin: Springer Science & Business Media; 2012.
Al-Radaideh QA, Bataineh DQ. 2018. A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognitive Computation, 1–19.
Arbelaitz O, Gurrutxaga I, Muguerza J, PéRez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern Recogn 2013;46(1):243–256.
Bandyopadhyay S, Maulik U. Nonparametric genetic clustering: comparison of validity indices. IEEE Trans Syst, Man, Cybern Part C (Applications and Reviews) 2001;31(1):120–125.
Bandyopadhyay S, Maulik U. Genetic clustering for automatic evolution of clusters and application to image classification. Pattern Recogn 2002;35(6):1197–1208.
Bandyopadhyay S, Saha S. Gaps: a clustering method using a new point symmetry-based distance measure. Pattern Recogn 2007;40(12):3430–3451.
Bandyopadhyay S, Saha S. A new principal axis based line symmetry measurement and its application to clustering. International Conference on Neural Information Processing. Springer; 2008. p. 543–550.
Bandyopadhyay S, Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Trans Knowl Data Eng 2008b;20(11):1441–1457.
Bandyopadhyay S, Maulik U, Mukhopadhyay A. Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Trans Geoscience Remote Sens 2007;45(5):1506–1511.
Bandyopadhyay S, Saha S, Maulik U, Deb K. A simulated annealing-based multiobjective optimization algorithm: Amosa. IEEE Trans Evol Comput 2008;12(3):269–283.
Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res 2003;3:993–1022.
Buitelaar P, Eigner T. Topic extraction from scientific literature for competency management. The 7th International Semantic Web Conference; 2008. p. 25–66.
Cardoso-Cachopo A. 2007. Improving Methods for Single-label Text Categorization PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.
Carpenter MP, Narin F. Clustering of scientific journals. J Assoc Inform Sci Technol 1973;24(6):425–436.
Yw C, Zhou Q, Luo W, Du JX. Classification of chinese texts based on recognition of semantic topics. Cogn Comput 2016;8(1):114–124. https://doi.org/10.1007/s12559-015-9346-8.
Das S, Abraham A, Konar A. Automatic clustering using an improved differential evolution algorithm. IEEE Trans Syst, Man, Cybern-Part A: Syst Human 2008;38(1):218–237.
Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 1979;1(2):224–227. https://doi.org/10.1109/TPAMI.1979.4766909.
Deb K, Vol. 16. Multi-objective optimization using evolutionary algorithms. New York: Wiley; 2001.
Deb K, Tiwari S. Omni-optimizer: a generic evolutionary algorithm for single and multi-objective optimization. Eur J Oper Res 2008;185(3):1062–1087.
Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 2002;6(2):182–197.
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006;7(Jan):1–30.
Doerre J, Gerstl P, Goeser S, Mueller A, Seiffert R. 2002. Taxonomy generation for document collections. US Patent 6,446,061.
Dutta P, Saha S. Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput Biol Med 2017;89:31–43.
Fortuna B, Grobelnik M, Mladenic D. Visualization of text document corpus. Informatica 2005;29:4.
Goldstein J, Mittal V, Carbonell J, Kantrowitz M. Multi-document summarization by sentence extraction. Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic Summarization - Volume 4, Association for Computational Linguistics, Stroudsburg, PA, USA, NAACL-ANLP-AutoSum ’00; 2000. p. 40–48. https://doi.org/10.3115/1117575.1117580.
Gu F, Liu HL, Tan KC. A multiobjective evolutionary algorithm using dynamic weight design method. Int J Innovative Comput Inf Control 2012;8:3677–3688.
Gupta V, Kaur N. A novel hybrid text summarization system for punjabi text. Cogn Comput 2016;8(2): 261–277.
Handl J, Knowles J. An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 2007; 11(1):56–76.
Haykin SS, Vol. 3. Neural networks and learning machines. Upper Saddle River: Pearson; 2009.
Iorio A, Li X. Rotated problems and rotationally invariant crossover in evolutionary multi-objective optimization. Int J Comput Intell Appl 2008;7(02):149–186.
Jain AK, Dubes RC. Algorithms for clustering data. Upper Saddle River: Prentice-Hall, Inc; 1988.
Kashef R, Kamel MS. Enhanced bisecting k-means clustering using intermediate cooperation. Pattern Recogn 2009;42(11):2557–2569.
Kennedy J. Particle swarm optimization. Encyclopedia of machine learning. Springer; 2011. p. 760–766.
Kohonen T. The self-organizing map. Neurocomputing 1998;21(1):1–6.
Konak A, Coit DW, Smith AE. Multi-objective optimization using genetic algorithms: a tutorial. Reliability Eng Syst Safety 2006;91(9):992–1007.
Korenius T, Laurikkala J, Järvelin K, Juhola M. Stemming and lemmatization in the clustering of finnish text documents. Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM; 2004. p. 625–633.
Kovács F, Legány C, Babos A. Cluster validity measurement techniques. 6th International symposium of hungarian researchers on computational intelligence; 2005.
Lauren P, Qu G, Yang J, Watta P, Huang GB, Lendasse A. 2018. Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks. Cognitive Computation, 1–14.
Le Q, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning (ICML-14); 2014. p. 1188–1196.
Li Y, Pan Q, Yang T, Wang S, Tang J, Cambria E. Learning word representations for sentiment analysis. Cogn Comput 2017;9(6):843–851.
Lichman M. 2013. UCI machine learning repository. http://archive.ics.uci.edu/ml.
Loper E, Bird S. Nltk: the natural language toolkit. Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, ETMTNLP ’02; 2002. p. 63–70. https://doi.org/10.3115/1118108.1118117.
Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2009.
Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 2002;24(12):1650–1654.
Mikolov T, Chen K, Corrado G, Dean J. 2013. Efficient estimation of word representations in vector space. arXiv:13013781.
Moran K, Wallace BC, Brodley CE. Discovering better aaai keywords via clustering with community-sourced constraints. AAAI; 2014. p. 1265–1271.
Pakhira MK, Bandyopadhyay S, Maulik U. Validity index for crisp and fuzzy clusters. Pattern Recogn 2004;37(3):487–501.
Pennington J, Socher R, Manning C. Glove: global vectors for word representation. Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1532–1543.
Price K, Storn RM, Lampinen JA. Differential evolution: a practical approach to global optimization. Berlin: Springer Science & Business Media; 2006.
Roussinov DG, Chen H. 1998. A scalable self-organizing map algorithm for textual classification: a neural network approach to thesaurus generation.
Saha S, Bandyopadhyay S. A symmetry based multiobjective clustering technique for automatic evolution of clusters. Pattern Recogn 2010;43(3):738–751.
Saha S, Bandyopadhyay S. Some connectivity based cluster validity indices. Appl Soft Comput 2012;12(5): 1555–1565.
Saha S, Bandyopadhyay S. A generalized automatic clustering algorithm in a multiobjective framework. Appl Soft Comput 2013;13(1):89–108.
Sahi M, Gupta V. A novel technique for detecting plagiarism in documents exploiting information sources. Cogn Comput 2017;9(6):852–867.
Saini N, Chourasia S, Saha S, Bhattacharyya P. A self organizing map based multi-objective framework for automatic evolution of clusters. International Conference on Neural Information Processing. Springer; 2017. p. 672–682.
Saini N, Saha S, Bhattacharyya P. Cascaded Som: an improved technique for automatic email classification. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE; 2018. p. 1–8.
Singh J, Gupta V. An efficient corpus-based stemmer. Cogn Comput 2017;9(5):671–688.
Starczewski A. A new validity index for crisp clusters. Pattern Anal Applic 2017;20(3):687–700.
Steinbach M, Karypis G, Kumar V, et al. A comparison of document clustering techniques. KDD Workshop on text mining, Boston; 2000. p. 525–526.
Suresh K, Kundu D, Ghosh S, Das S, Abraham A. Data clustering using multi-objective differential evolution algorithms. Fundamenta Informaticae 2009;97(4):381–403.
Wang H. 2014. Introduction to word2vec and its application to find predominant word senses. http://complinghssntuedusg/courses/hg7017/pdf/word2vec and its application to wsd pdf.
Welch BL. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 1947;34(1/2):28–35. http://www.jstor.org/stable/2332510.
Witten I, Bainbridge D, Paynter G, Boddie S. 2002. Importing documents and metadata into digital libraries: requirements analysis and an extensible architecture. Research and Advanced Technology for Digital Libraries, 219–229.
Xu W, Liu X, Gong Y. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM; 2003 . p. 267–273.
Zhang H, Zhang X, Gao XZ, Song S. Self-organizing multiobjective optimization based on decomposition with neighborhood ensemble. Neurocomputing 2016;173:1868–1884.
Zhang H, Zhou A, Song S, Zhang Q, Gao XZ, Zhang J. A self-organizing multiobjective evolutionary algorithm. IEEE Trans Evol Comput 2016;20(5):792–806. https://doi.org/10.1109/TEVC.2016.2521868.
Zhou A, Qf Z, Zhang G. Multiobjective evolutionary algorithm based on mixture gaussian models. J Softw 2014;25(5):913–928.
Acknowledgments
Dr. Sriparna Saha would like to acknowledge the support from SERB Women in Excellence Award-SB/WEA/08/2017 for conducting this particular research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Rights and permissions
About this article
Cite this article
Saini, N., Saha, S. & Bhattacharyya, P. Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution. Cogn Comput 11, 271–293 (2019). https://doi.org/10.1007/s12559-018-9611-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-018-9611-8