Abstract
The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters’ contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics’ set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set’s efficiency, as well as effectiveness. The results also showed that the system’s quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods.
Similar content being viewed by others
Notes
References
Abasi AK, Khader AT, Al-Betar MA, Naim S, Makhadmeh SN, Alyasseri ZAA (2019) A text feature selection technique based on binary multiverse optimizer for text clustering. In: 2019 IEEE Jordan international joint conference on electrical engineering and information technology (JEEIT), IEEE (pp 1–6)
Abasi AK, Khader AT, Al-Betar MA, Naim S, Makhadmeh SN, Alyasseri ZAA (2020) Link-based multi-verse optimizer for text documents clustering, vol 87
Abasi AK, Khader AT, Al-Betar MA, Naim S, Alyasseri ZAA, Makhadmeh SN A novel hybrid multi-verse optimizer with k-means for text documents clustering
Abualigah L (2020) Multi-verse optimizer algorithm: a comprehensive survey of its results, variants, and applications. Neural Comput Applic, pp 1–21
Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435
Aljarah I, Mafarja M, Heidari AA, Faris H, Mirjalili S (2020) Multi-verse optimizer: theory, literature review, and application in data clustering. In: Nature-Inspired Optimizers, Springer, pp 123–141
Alyasseri ZAA, Khadeer AT, Al-Betar MA, Abasi A, Makhadmeh S, Ali NS (2019) The effects of eeg feature extraction using multi-wavelet decomposition for mental tasks classification. In: Proceedings of the international conference on information and communication technology, pp 139–146
Alyasseri ZAA, Khader AT, Al-Betar MA, Abasi AK, Makhadmeh SN (2019) Eeg signals denoising using optimal wavelet transform hybridized with efficient metaheuristic methods. IEEE Access 8:10584–10605
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp 1027–1035
Barrow JD, Davies PC, Harper CL Jr (2004) Science and ultimate reality: Quantum theory, cosmology, and complexity. Cambridge University Press
Beliga S, Meštrović A, Martinčić-Ipšić S (2015) An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences 39(1):1–20
Benmessahel I, Xie K, Chellal M (2017) A new evolutionary neural networks based on intrusion detection systems using multiverse optimization. Appl Intell, pp 1–13
Bezdek JC (2013) Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media
Bornmann L, Mutz R (2015) Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J Assoc Inform Sci Technol 66(11):2215–2222
Bouras C, Tsogkas V (2012) A clustering technique for news articles using wordnet. Knowl-Based Syst 36:115–128
Cagnina L, Errecalde M, Ingaramo D, Rosso P (2014) An efficient particle swarm optimization approach to cluster short texts. Inf Sci 265:36–49
Chen C-H (2017) Improved tfidf in big news retrieval: an empirical study. Pattern Recogn Lett 93:113–122
Chouksey M, Jha RK, Sharma R (2020) A fast technique for image segmentation based on two meta-heuristic algorithms. Multimedia Tools and Applications, pp 1–53
Collective Evolution (2018) New physics theory questions the big bang: How did our universe really begin?, [Online; accessed August 9, 2018]. https://www.collective-evolution.com/2018/08/09/the-big-bang-questioned-the-end-of-scientific-dogma-how-did-our-universe-really-begin
Davidson I, Ravi S (2005) Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In: European conference on principles of data mining and knowledge discovery, Springer, pp 59–70
Deepa M, Revathy P, Student P (2012) Validation of document clustering based on purity and entropy measures. Int J Adv Res Computer Commun Eng 1(3):147–152
Degertekin S, Hayalioglu M (2013) Sizing truss structures using teaching-learning-based optimization. Computers & Structures 119:177–188
Del Buono N, Pio G (2015) Non-negative matrix tri-factorization for co-clustering: an analysis of the block matrix. Inf Sci 301:13–26
Du S-Y, Liu Z-G (2020) Hybridizing particle swarm optimization with jade for continuous optimization. Multimedia Tools and Applications 79 (7):4619–4636
Emrouznejad A, Yang G-l (2018) A survey and analysis of the first 40 years of scholarly literature in dea: 1978–2016. Socio Econ Plan Sci 61:4–8
Ewees AA, El Aziz MA, Hassanien AE (2017) Chaotic multi-verse optimizer-based feature selection. Neural Comput Applic, pp 1–16
Faris H, Aljarah I, Mirjalili S (2016) Training feedforward neural networks using multi-verse optimizer for binary classification problems. Appl Intell 45(2):322–332
Faris H, Hassonah MA, Ala’M A-Z, Mirjalili S, Aljarah I (2017) A multi-verse optimizer approach for feature selection and optimizing svm parameters based on a robust system architecture. Neural Computing and Applications, pp 1–15
Fathy A, Rezk H (2018) Multi-verse optimizer for identifying the optimal parameters of pemfc model. Energy 143:634–644
Forsati R, Keikha A, Shamsfard M (2015) An improved bee colony optimization algorithm with an application to document clustering. Neurocomputing 159:9–26
Forsati R, Mahdavi M, Shamsfard M, Meybodi MR (2013) Efficient stochastic algorithms for document clustering. Inf Sci 220:269–291
Gandomi AH, Alavi AH (2012) Krill herd: a new bio-inspired optimization algorithm. Commun Nonlinear Sci Numer Simul 17(12):4831–4845
Goldberg DE, Holland JH (1988) Genetic algorithms and machine learning. Mach Learn 3(2):95–99
Grineva M, Grinev M, Lizorkin D (2009) Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World wide Web, ACM, pp 661–670
HaCohen-Kerner Y, Gross Z, Masa A (2005) Automatic extraction and learning of keyphrases from scientific articles. In: International conference on intelligent text processing and computational linguistics, Springer, pp 657–669
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pp 49–56
Huang C, Tian Y, Zhou Z, Ling CX, Huang T (2006) Keyphrase extraction using semantic networks structure analysis. In: Sixth international conference on data mining (ICDM’06), IEEE, pp 275–284
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on Empirical methods in natural language processing, Association for Computational Linguistics, pp 216–223
Hussain SF, Haris M (2019) A k-means based co-clustering (kcc) algorithm for sparse, high dimensional data. Expert Syst Appl 118:20–34
Ienco D, Bordogna G (2018) Fuzzy extensions of the dbscan clustering algorithm. Soft Comput 22(5):1719–1730
Janiga D, Czarnota R, Stopa J, Wojnarowski P, Kosowski P (2017) Performance of nature inspired optimization algorithms for polymer enhanced oil recovery process. J Pet Sci Eng 154:354–366
Jayapal J, Subban R (2020) Automated lion optimization algorithm assisted denoising approach with multiple filters. Multimedia Tools and Applications 79(5):4041–4056
Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of Intelligent Optimization in Biology and Medicine, Springer, pp 267–287
Kaveh A, Farhoudi N (2013) A new optimization method: Dolphin echolocation. Adv Eng Softw 59:53–70
Kaveh A, Khayatazad M (2012) A new meta-heuristic method: ray optimization. Computers & Structures 112:283–294
Kaveh A, Sheikholeslami R, Talatahari S, Keshvari-Ilkhichi M (2014) Chaotic swarming of particles: a new method for size optimization of truss structures. Adv Eng Softw 67:136–147
Koopman R, Wang S (2017) Mutual information based labelling and comparing clusters. Scientometrics 111(2):1157–1167
Koopman R, Wang S, Scharnhorst A (2017) Contextualization of topics: Browsing through the universe of bibliographic information. Scientometrics 111(2):1119–1139
Krapivin M, Autayeu A, Marchese M, Blanzieri E, Segata N (2010) Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing. In: International conference on asian digital libraries, Springer, pp 102–111
Kumar P, Garg S, Singh A, Batra S, Kumar N, You I Mvo-based two-dimensional path planning scheme for providing quality of service in uav environment. IEEE Internet of Things Journal
Lee S, Kim H-j (2008) News keyword extraction for topic tracking. In: 2008. NCM’08. Fourth International Conference on Networked computing and advanced information management, Vol 2, IEEE, pp 554–559
Lin Y-S, Jiang J-Y, Lee S-J (2014) A similarity measure for text classification and clustering. IEEE Trans Knowledge Data Eng 26(7):1575–1590
Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 conference on empirical methods in natural language processing, Association for Computational Linguistics, pp 366–376
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, Oakland, pp 281–297
Mafarja MM, Mirjalili S (2019) Hybrid binary ant lion optimizer with rough set and approximate entropy reducts for feature selection. Soft Comput 23(15):6249–6265
Makhadmeh SN, Khader AT, Al-Betar MA, Naim S (2018) Multi-objective power scheduling problem in smart homes using grey wolf optimiser. Journal of Ambient Intelligence and Humanized Computing, pp 1–25
Makhadmeh SN, Khader AT, Al-Betar MA, Naim S, Abasi AK, Alyasseri ZAA (2019) Optimization methods for power scheduling problems in smart home: Survey. Renew Sust Energ Rev 115:109362
Makhadmeh SN, Khader AT, Al-Betar MA, Naim S, Alyasseri ZAA, Abasi AK (2019) Particle swarm optimization algorithm for power scheduling problem using smart battery. In: 2019 IEEE Jordan international joint conference on electrical engineering and information technology, JEEIT, IEEE, pp 672–677
Maki A, Sakamoto N, Akimoto Y, Nishikawa H, Umeda N (2019) Application of optimal control theory based on the evolution strategy (cma-es) to automatic berthing. J Mar Sci Technol, pp 1–13
Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(01):157–169
Meshkat M, Parhizgar M (2017) Stud multi-verse algorithm. In: 2017 2nd conference on swarm intelligence and evolutionary computation, SIEC, IEEE, pp 42–47
Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411
Mirjalili S (2015) The ant lion optimizer. Adv Eng Softw 83:80–98
Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Applic 27(4):1053–1073
Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM (2017) Salp swarm algorithm: a bio-inspired optimizer for engineering design problems. Adv Eng Softw 114:163–191
Mirjalili S, Mirjalili SM, Hatamlou A (2016) Multi-verse optimizer: a nature-inspired algorithm for global optimization. Neural Comput Applic 27(2):495–513
Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
Moh’d Alia O, Al-Betar MA, Mandava R, Khader AT (2011) Data clustering using harmony search algorithm. In: International conference on swarm, evolutionary, and memetic computing, Springer, pp 79–88
Najafi E, Darooneh AH (2015) The fractal patterns of words in a text: a method for automatic keyword extraction. PloS One 10(6):e0130617
Nguyen TD, Kan M-Y (2007) Keyphrase extraction in scientific publications. In: International conference on Asian digital libraries, Springer, pp 317–326
Onan A, Korukoğlu S, Bulut H (2016) Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst Appl 57:232–247
Pan W-T (2012) A new fruit fly optimization algorithm: taking the financial distress model as an example. Knowl-Based Syst 26:69–74
Pan W, Zhou Y, Li Z (2017) An exponential function inflation size of multi-verse optimisation algorithm for global optimisation. Int J Comput Sci Math 8(2):115–128
Park H-S, Jun C-H (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341
Patel MRR (2017) An improved document clustering with multiview point similarity/dissimilarity measures. International Journal Of Engineering And Computer Science 6 (2)
Pay T, Lucci S (2017) Automatic keyword extraction: an ensemble method. In: 2017 IEEE international conference on big data, Big Data, IEEE, 4816–4818
Pierezan J, Coelho LDS (2018) Coyote optimization algorithm: a new metaheuristic for global optimization problems. In: 2018 IEEE congress on evolutionary computation, CEC, IEEE, pp 1–8
Pierezan J, Maidl G, Yamao EM, dos Santos Coelho L, Mariani VC (2019) Cultural coyote optimization algorithm applied to a heavy duty gas turbine operation. Energy Convers Manag 199:111932
Pijarski P, Kacejko P (2019) A new metaheuristic optimization method: the algorithm of the innovative gunner (aig). Eng Optim, pp 1–20
Prabowo R, Thelwall M (2009) Sentiment analysis: a combined approach. Journal of Informetrics 3(2):143–157
Rana S, Jasola S, Kumar R (2011) A review on particle swarm optimization algorithms and their applications to data clustering. Artif Intell Rev 35 (3):211–222
Role F, Nadif M (2014) Beyond cluster labeling: Semantic interpretation of clusters’ contents using a graph representation. Knowl-Based Syst 56:141–155
Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1:1–20
Sayed GI, Darwish A, Hassanien AE (2017) Quantum multiverse optimization algorithm for optimization problems. Neural Comput Applic, pp 1–18
Sayed GI, Darwish A, Hassanien AE (2018) A new chaotic multi-verse optimization algorithm for solving engineering optimization problems. Journal of Experimental & Theoretical Artificial Intelligence 30(2):293–317
Seifert C, Ulbrich E, Granitzer M (2011) Word clouds for efficient document labeling. In: International conference on discovery science, Springer, pp 292–306
Shafiabady N, Lee LH, Rajkumar R, Kallimani V, Akram NA, Isa D (2016) Using unsupervised clustering approach to train the support vector machine for text classification. Neurocomputing 211:4–10
Shaikh ZA (2018) Keyword detection techniques: A comprehensive study, Engineering. Technol Appl Sci Res 8(1):2590–2594
Shukri S, Faris H, Aljarah I, Mirjalili S, Abraham A (2018) Evolutionary static and dynamic clustering algorithms based on multi-verse optimizer. Eng Appl Artif Intell 72:54–66
Siddiqi S, Sharan A Keyword and keyphrase extraction techniques: a literature revie. International Journal of Computer Applications 109 (2), 18–23
Smithsonian Institution (2016) Can physicists ever prove the multiverse is real?, [Online; accessed April 19, 2016]. https://www.smithsonianmag.com/science-nature/can-physicists-ever-prove-multiverse-real-180958813/
Tan SC, Ting KM, Teng SW (2011) A general stochastic clustering method for automatic cluster discovery. Pattern Recogn 44(10-11):2786–2799
Turney PD Coherent keyphrase extraction via web mining. arXiv:cs/0308033
Vishwakarma S, Nair PS, Rao DS A comparative study of k-means and k-medoid clustering for social media text mining, International Journal 2 (11), 297–302
Wang Z, Hahn K, Kim Y, Song S, Seo J-M (2018) A news-topic recommender system based on keywords extraction. Multimedia Tools and Applications 77(4):4339–4353
Wang S, Koopman R (2017) Clustering articles based on semantic similarity. Scientometrics 111(2):1017–1031
Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42 (4):2264–2275
Witten IH, Medelyan O (2006) Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06), IEEE, pp 296–297
Yang X-S, Deb S Engineering optimisation by cuckoo search. arXiv:1005.2908
Zeng S, Tong X, Sang N (2014) Study on multi-center fuzzy c-means algorithm based on transitive closure and spectral clustering. Appl Soft Comput 16:89–101
Zhang C (2008) Automatic keyword extraction from documents using conditional random fields. J Comput Inform Systss 4(3):1169–1180
Zhang Z, Petrak J, Maynard D (2018) Adapted textrank for term extraction: a generic method of improving automatic term extraction algorithms. Procedia Computer Sci 137:102–108
Zhang Y, Zhang G, Chen H, Porter AL, Zhu D, Lu J (2016) Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technol Forecast Soc Chang 105:179–191
Zhao Y, Karypis G (2001) Criterion functions for document clustering: Experiments and analysis
Acknowledgements
This work was supported by universiti sains malaysia (USM) under Grant (1001/PK OMP/8014016).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Abasi, A.K., Khader, A.T., Al-Betar, M.A. et al. A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering. Multimed Tools Appl 80, 37–82 (2021). https://doi.org/10.1007/s11042-020-09504-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09504-2