Abstract
The technical advances in the information systems contribute towards the massive availability of the documents stored in the electronic database, such as e-mails, internet and web pages. Thus, it becomes a complex task for arranging and browsing the required document. This paper proposes an incremental document clustering method for performing effective document clustering. The proposed model undergoes three steps for document clustering, namely pre-processing, feature extraction and Incremental document categorization. The pre-processing step is carried out for removing the artifacts and redundant data from the documents by undergoing stop word removal process and stemming process. Then, the next step is the feature extraction based on Term Frequency-Inverse Document Frequency (TF–IDF) and Wordnet features. Here, the feature is selected using support measure named ModSupport, and then, the incremental document clustering is performed based on the hybrid fuzzy bounding degree and Rider-Moth Flame optimization algorithm (RMFO) using the boundary degree. Here, the RMFO aims at the selection of the optimal weights for the boundary degree model and is designed by integrating Rider Optimization Algorithm (ROA) with Moth Flame optimization (MFO). The performance of the proposed RMFO outperformed the existing techniques using accuracy, F-measure, precision, and recall with maximal values 93.98%, 94.876%, 93.958% and 93.964% respectively.
Similar content being viewed by others
References
Chevalier M, El Malki M, Kopliku A, Teste O, Tournier R (2016) Implementation of multidimensional databases with document-oriented NoSQL. In: Big data analytics and knowledge discovery, pp 379–390
Martinho B, Santos MY (2016) An architecture for data warehousing in big data environments. In: Research and practical issues of enterprise information systems, vol 268, pp 237–250
Doermann D (1998) The indexing and retrieval of document images: a survey. Comput Vis Image Underst 70(3):287–298
Callan JP (1994) Passage-level evidence in document retrieval. In: SIGIR. Springer, Berlin, pp 302–310
Hao S, Shi C, Niu Z, Cao L (2018) Concept coupling learning for improving concept lattice-based document retrieval. Eng Appl Artif Intell 69:65–75
Mothe J, Chrisment C, Dousset B, Alaux J (2003) DocCube: multi-dimensional visualisation and exploration of large document sets. J Am Soc Inf Sci Technol 54(7):650–659
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd annual international conference on research and development in information retrieval, pp 208–215
Karypis MSG, Kumar V, Steinbach M (2000) A comparison of document clustering techniques. In: Proceedings of TextMining workshop at KDD2000, May 2000
Li N, Luo W, Yang K, Zhuang F, He Q, Shi Z (2018) Self-organizing weighted incremental probabilistic latent semantic analysis. Int J Mach Learn Cybern 9(12):1987–1998
Wan Y, Liu X, Wu Y, Guo L, Chen Q, Wang M (2018) ICGT: a novel incremental clustering approach based on GMM tree. Data Knowl Eng 117:71–86
Sangaiah AK, Fakhry AE, Abdel-Basset M, El-Henawy I (2018) Arabic text clustering using improved clustering algorithms with dimensionality reduction. Cluster Comput 22:1–15
Kotte VK, Rajavelu S, Rajsingh EB (2019) A similarity function for feature pattern clustering and high dimensional text document classification. Found Sci. https://doi.org/10.1007/s10699-019-09592-w
Mulay P, Shinde K (2019) Personalized diabetes analysis using correlation-based incremental clustering algorithm. In: Mittal M, Balas VE, Goyal LM, Kumar R (eds) Big data processing using spark in cloud. Springer, Berlin, pp 167–193
Madhusudhanan S, Jaganathan S (2018) Incremental learning for classification of unstructured data using extreme learning machine. Algorithms 11(10):158
Kannan J, Shanavas AM, Swaminathan S (2018) SportsBuzzer: detecting events at real time in Twitter using incremental clustering. Trans Mach Learn Artif Intell 6(1):01
Liu Y, Chen J, Wu S, Liu Z, Chao H (2018) Incremental fuzzy C medoids clustering of time series data using dynamic time warping distance. PLoS ONE 13(5):0197499
Binu D, Kariyappa BS (2018) RideNN: a new rider optimization algorithm-based neural network for fault diagnosis in analog circuits. IEEE Trans Instrum Meas 68:2–26
Mirjalili S (2015) Moth–flame optimization algorithm: a novel nature-inspired heuristic paradigm. Knowl Based Syst 89:228–249
Sedding J, Kazakov D (2004) WordNet-based text document clustering. In: Proceedings of the 3rd workshop on robust methods in analysis of natural language data, pp 104–113
Yarlagadda M, Gangadhara Roa K, Srikrishna A (2019) Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.002
Xu Z, Xia M (2011) Distance and similarity measures for hesitant fuzzy sets. Inf Sci 181(11):2128–2138
Newsgroup database. http://qwone.com/~jason/20Newsgroups/. Accessed Oct 2018
Reuter Database. https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection. Accessed Oct 2018
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yarlagadda, M., Kancherla, G.R. & Atluri, S. Incremental document clustering using fuzzy-based optimization strategy. Evol. Intel. 13, 497–510 (2020). https://doi.org/10.1007/s12065-019-00335-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12065-019-00335-1