Skip to main content
Log in

A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters’ contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics’ set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set’s efficiency, as well as effectiveness. The results also showed that the system’s quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/index.php/

  2. https://www.kaggle.com/ammarabbasi/20newsgroups-300-articles

  3. http://archive.ics.uci.edu/ml/index.php/

  4. http://archive.ics.uci.edu/ml/index.php/

  5. https://www.kaggle.com/benhamner/exploring-the-nips-2015-papers/data

  6. https://www.archive.ics.uci.edu/ml

  7. http://sites.labic.icmc.usp.br/sket/datasets/

References

  1. Abasi AK, Khader AT, Al-Betar MA, Naim S, Makhadmeh SN, Alyasseri ZAA (2019) A text feature selection technique based on binary multiverse optimizer for text clustering. In: 2019 IEEE Jordan international joint conference on electrical engineering and information technology (JEEIT), IEEE (pp 1–6)

  2. Abasi AK, Khader AT, Al-Betar MA, Naim S, Makhadmeh SN, Alyasseri ZAA (2020) Link-based multi-verse optimizer for text documents clustering, vol 87

  3. Abasi AK, Khader AT, Al-Betar MA, Naim S, Alyasseri ZAA, Makhadmeh SN A novel hybrid multi-verse optimizer with k-means for text documents clustering

  4. Abualigah L (2020) Multi-verse optimizer algorithm: a comprehensive survey of its results, variants, and applications. Neural Comput Applic, pp 1–21

  5. Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435

    Google Scholar 

  6. Aljarah I, Mafarja M, Heidari AA, Faris H, Mirjalili S (2020) Multi-verse optimizer: theory, literature review, and application in data clustering. In: Nature-Inspired Optimizers, Springer, pp 123–141

  7. Alyasseri ZAA, Khadeer AT, Al-Betar MA, Abasi A, Makhadmeh S, Ali NS (2019) The effects of eeg feature extraction using multi-wavelet decomposition for mental tasks classification. In: Proceedings of the international conference on information and communication technology, pp 139–146

  8. Alyasseri ZAA, Khader AT, Al-Betar MA, Abasi AK, Makhadmeh SN (2019) Eeg signals denoising using optimal wavelet transform hybridized with efficient metaheuristic methods. IEEE Access 8:10584–10605

    Google Scholar 

  9. Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp 1027–1035

  10. Barrow JD, Davies PC, Harper CL Jr (2004) Science and ultimate reality: Quantum theory, cosmology, and complexity. Cambridge University Press

  11. Beliga S, Meštrović A, Martinčić-Ipšić S (2015) An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences 39(1):1–20

    Google Scholar 

  12. Benmessahel I, Xie K, Chellal M (2017) A new evolutionary neural networks based on intrusion detection systems using multiverse optimization. Appl Intell, pp 1–13

  13. Bezdek JC (2013) Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media

  14. Bornmann L, Mutz R (2015) Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J Assoc Inform Sci Technol 66(11):2215–2222

    Google Scholar 

  15. Bouras C, Tsogkas V (2012) A clustering technique for news articles using wordnet. Knowl-Based Syst 36:115–128

    Google Scholar 

  16. Cagnina L, Errecalde M, Ingaramo D, Rosso P (2014) An efficient particle swarm optimization approach to cluster short texts. Inf Sci 265:36–49

    Google Scholar 

  17. Chen C-H (2017) Improved tfidf in big news retrieval: an empirical study. Pattern Recogn Lett 93:113–122

    Google Scholar 

  18. Chouksey M, Jha RK, Sharma R (2020) A fast technique for image segmentation based on two meta-heuristic algorithms. Multimedia Tools and Applications, pp 1–53

  19. Collective Evolution (2018) New physics theory questions the big bang: How did our universe really begin?, [Online; accessed August 9, 2018]. https://www.collective-evolution.com/2018/08/09/the-big-bang-questioned-the-end-of-scientific-dogma-how-did-our-universe-really-begin

  20. Davidson I, Ravi S (2005) Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In: European conference on principles of data mining and knowledge discovery, Springer, pp 59–70

  21. Deepa M, Revathy P, Student P (2012) Validation of document clustering based on purity and entropy measures. Int J Adv Res Computer Commun Eng 1(3):147–152

    Google Scholar 

  22. Degertekin S, Hayalioglu M (2013) Sizing truss structures using teaching-learning-based optimization. Computers & Structures 119:177–188

    Google Scholar 

  23. Del Buono N, Pio G (2015) Non-negative matrix tri-factorization for co-clustering: an analysis of the block matrix. Inf Sci 301:13–26

    Google Scholar 

  24. Du S-Y, Liu Z-G (2020) Hybridizing particle swarm optimization with jade for continuous optimization. Multimedia Tools and Applications 79 (7):4619–4636

    Google Scholar 

  25. Emrouznejad A, Yang G-l (2018) A survey and analysis of the first 40 years of scholarly literature in dea: 1978–2016. Socio Econ Plan Sci 61:4–8

    Google Scholar 

  26. Ewees AA, El Aziz MA, Hassanien AE (2017) Chaotic multi-verse optimizer-based feature selection. Neural Comput Applic, pp 1–16

  27. Faris H, Aljarah I, Mirjalili S (2016) Training feedforward neural networks using multi-verse optimizer for binary classification problems. Appl Intell 45(2):322–332

    Google Scholar 

  28. Faris H, Hassonah MA, Ala’M A-Z, Mirjalili S, Aljarah I (2017) A multi-verse optimizer approach for feature selection and optimizing svm parameters based on a robust system architecture. Neural Computing and Applications, pp 1–15

  29. Fathy A, Rezk H (2018) Multi-verse optimizer for identifying the optimal parameters of pemfc model. Energy 143:634–644

    Google Scholar 

  30. Forsati R, Keikha A, Shamsfard M (2015) An improved bee colony optimization algorithm with an application to document clustering. Neurocomputing 159:9–26

    Google Scholar 

  31. Forsati R, Mahdavi M, Shamsfard M, Meybodi MR (2013) Efficient stochastic algorithms for document clustering. Inf Sci 220:269–291

    MathSciNet  Google Scholar 

  32. Gandomi AH, Alavi AH (2012) Krill herd: a new bio-inspired optimization algorithm. Commun Nonlinear Sci Numer Simul 17(12):4831–4845

    MathSciNet  MATH  Google Scholar 

  33. Goldberg DE, Holland JH (1988) Genetic algorithms and machine learning. Mach Learn 3(2):95–99

    Google Scholar 

  34. Grineva M, Grinev M, Lizorkin D (2009) Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World wide Web, ACM, pp 661–670

  35. HaCohen-Kerner Y, Gross Z, Masa A (2005) Automatic extraction and learning of keyphrases from scientific articles. In: International conference on intelligent text processing and computational linguistics, Springer, pp 657–669

  36. Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pp 49–56

  37. Huang C, Tian Y, Zhou Z, Ling CX, Huang T (2006) Keyphrase extraction using semantic networks structure analysis. In: Sixth international conference on data mining (ICDM’06), IEEE, pp 275–284

  38. Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on Empirical methods in natural language processing, Association for Computational Linguistics, pp 216–223

  39. Hussain SF, Haris M (2019) A k-means based co-clustering (kcc) algorithm for sparse, high dimensional data. Expert Syst Appl 118:20–34

    Google Scholar 

  40. Ienco D, Bordogna G (2018) Fuzzy extensions of the dbscan clustering algorithm. Soft Comput 22(5):1719–1730

    MATH  Google Scholar 

  41. Janiga D, Czarnota R, Stopa J, Wojnarowski P, Kosowski P (2017) Performance of nature inspired optimization algorithms for polymer enhanced oil recovery process. J Pet Sci Eng 154:354–366

    Google Scholar 

  42. Jayapal J, Subban R (2020) Automated lion optimization algorithm assisted denoising approach with multiple filters. Multimedia Tools and Applications 79(5):4041–4056

    Google Scholar 

  43. Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of Intelligent Optimization in Biology and Medicine, Springer, pp 267–287

  44. Kaveh A, Farhoudi N (2013) A new optimization method: Dolphin echolocation. Adv Eng Softw 59:53–70

    Google Scholar 

  45. Kaveh A, Khayatazad M (2012) A new meta-heuristic method: ray optimization. Computers & Structures 112:283–294

    Google Scholar 

  46. Kaveh A, Sheikholeslami R, Talatahari S, Keshvari-Ilkhichi M (2014) Chaotic swarming of particles: a new method for size optimization of truss structures. Adv Eng Softw 67:136–147

    Google Scholar 

  47. Koopman R, Wang S (2017) Mutual information based labelling and comparing clusters. Scientometrics 111(2):1157–1167

    Google Scholar 

  48. Koopman R, Wang S, Scharnhorst A (2017) Contextualization of topics: Browsing through the universe of bibliographic information. Scientometrics 111(2):1119–1139

    Google Scholar 

  49. Krapivin M, Autayeu A, Marchese M, Blanzieri E, Segata N (2010) Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing. In: International conference on asian digital libraries, Springer, pp 102–111

  50. Kumar P, Garg S, Singh A, Batra S, Kumar N, You I Mvo-based two-dimensional path planning scheme for providing quality of service in uav environment. IEEE Internet of Things Journal

  51. Lee S, Kim H-j (2008) News keyword extraction for topic tracking. In: 2008. NCM’08. Fourth International Conference on Networked computing and advanced information management, Vol 2, IEEE, pp 554–559

  52. Lin Y-S, Jiang J-Y, Lee S-J (2014) A similarity measure for text classification and clustering. IEEE Trans Knowledge Data Eng 26(7):1575–1590

    Google Scholar 

  53. Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 conference on empirical methods in natural language processing, Association for Computational Linguistics, pp 366–376

  54. MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, Oakland, pp 281–297

  55. Mafarja MM, Mirjalili S (2019) Hybrid binary ant lion optimizer with rough set and approximate entropy reducts for feature selection. Soft Comput 23(15):6249–6265

    Google Scholar 

  56. Makhadmeh SN, Khader AT, Al-Betar MA, Naim S (2018) Multi-objective power scheduling problem in smart homes using grey wolf optimiser. Journal of Ambient Intelligence and Humanized Computing, pp 1–25

  57. Makhadmeh SN, Khader AT, Al-Betar MA, Naim S, Abasi AK, Alyasseri ZAA (2019) Optimization methods for power scheduling problems in smart home: Survey. Renew Sust Energ Rev 115:109362

    Google Scholar 

  58. Makhadmeh SN, Khader AT, Al-Betar MA, Naim S, Alyasseri ZAA, Abasi AK (2019) Particle swarm optimization algorithm for power scheduling problem using smart battery. In: 2019 IEEE Jordan international joint conference on electrical engineering and information technology, JEEIT, IEEE, pp 672–677

  59. Maki A, Sakamoto N, Akimoto Y, Nishikawa H, Umeda N (2019) Application of optimal control theory based on the evolution strategy (cma-es) to automatic berthing. J Mar Sci Technol, pp 1–13

  60. Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(01):157–169

    Google Scholar 

  61. Meshkat M, Parhizgar M (2017) Stud multi-verse algorithm. In: 2017 2nd conference on swarm intelligence and evolutionary computation, SIEC, IEEE, pp 42–47

  62. Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411

  63. Mirjalili S (2015) The ant lion optimizer. Adv Eng Softw 83:80–98

    Google Scholar 

  64. Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Applic 27(4):1053–1073

    MathSciNet  Google Scholar 

  65. Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM (2017) Salp swarm algorithm: a bio-inspired optimizer for engineering design problems. Adv Eng Softw 114:163–191

    Google Scholar 

  66. Mirjalili S, Mirjalili SM, Hatamlou A (2016) Multi-verse optimizer: a nature-inspired algorithm for global optimization. Neural Comput Applic 27(2):495–513

    Google Scholar 

  67. Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61

    Google Scholar 

  68. Moh’d Alia O, Al-Betar MA, Mandava R, Khader AT (2011) Data clustering using harmony search algorithm. In: International conference on swarm, evolutionary, and memetic computing, Springer, pp 79–88

  69. Najafi E, Darooneh AH (2015) The fractal patterns of words in a text: a method for automatic keyword extraction. PloS One 10(6):e0130617

    Google Scholar 

  70. Nguyen TD, Kan M-Y (2007) Keyphrase extraction in scientific publications. In: International conference on Asian digital libraries, Springer, pp 317–326

  71. Onan A, Korukoğlu S, Bulut H (2016) Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst Appl 57:232–247

    Google Scholar 

  72. Pan W-T (2012) A new fruit fly optimization algorithm: taking the financial distress model as an example. Knowl-Based Syst 26:69–74

    Google Scholar 

  73. Pan W, Zhou Y, Li Z (2017) An exponential function inflation size of multi-verse optimisation algorithm for global optimisation. Int J Comput Sci Math 8(2):115–128

    MathSciNet  Google Scholar 

  74. Park H-S, Jun C-H (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341

    Google Scholar 

  75. Patel MRR (2017) An improved document clustering with multiview point similarity/dissimilarity measures. International Journal Of Engineering And Computer Science 6 (2)

  76. Pay T, Lucci S (2017) Automatic keyword extraction: an ensemble method. In: 2017 IEEE international conference on big data, Big Data, IEEE, 4816–4818

  77. Pierezan J, Coelho LDS (2018) Coyote optimization algorithm: a new metaheuristic for global optimization problems. In: 2018 IEEE congress on evolutionary computation, CEC, IEEE, pp 1–8

  78. Pierezan J, Maidl G, Yamao EM, dos Santos Coelho L, Mariani VC (2019) Cultural coyote optimization algorithm applied to a heavy duty gas turbine operation. Energy Convers Manag 199:111932

    Google Scholar 

  79. Pijarski P, Kacejko P (2019) A new metaheuristic optimization method: the algorithm of the innovative gunner (aig). Eng Optim, pp 1–20

  80. Prabowo R, Thelwall M (2009) Sentiment analysis: a combined approach. Journal of Informetrics 3(2):143–157

    Google Scholar 

  81. Rana S, Jasola S, Kumar R (2011) A review on particle swarm optimization algorithms and their applications to data clustering. Artif Intell Rev 35 (3):211–222

    Google Scholar 

  82. Role F, Nadif M (2014) Beyond cluster labeling: Semantic interpretation of clusters’ contents using a graph representation. Knowl-Based Syst 56:141–155

    Google Scholar 

  83. Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1:1–20

    Google Scholar 

  84. Sayed GI, Darwish A, Hassanien AE (2017) Quantum multiverse optimization algorithm for optimization problems. Neural Comput Applic, pp 1–18

  85. Sayed GI, Darwish A, Hassanien AE (2018) A new chaotic multi-verse optimization algorithm for solving engineering optimization problems. Journal of Experimental & Theoretical Artificial Intelligence 30(2):293–317

    Google Scholar 

  86. Seifert C, Ulbrich E, Granitzer M (2011) Word clouds for efficient document labeling. In: International conference on discovery science, Springer, pp 292–306

  87. Shafiabady N, Lee LH, Rajkumar R, Kallimani V, Akram NA, Isa D (2016) Using unsupervised clustering approach to train the support vector machine for text classification. Neurocomputing 211:4–10

    Google Scholar 

  88. Shaikh ZA (2018) Keyword detection techniques: A comprehensive study, Engineering. Technol Appl Sci Res 8(1):2590–2594

    Google Scholar 

  89. Shukri S, Faris H, Aljarah I, Mirjalili S, Abraham A (2018) Evolutionary static and dynamic clustering algorithms based on multi-verse optimizer. Eng Appl Artif Intell 72:54–66

    Google Scholar 

  90. Siddiqi S, Sharan A Keyword and keyphrase extraction techniques: a literature revie. International Journal of Computer Applications 109 (2), 18–23

  91. Smithsonian Institution (2016) Can physicists ever prove the multiverse is real?, [Online; accessed April 19, 2016]. https://www.smithsonianmag.com/science-nature/can-physicists-ever-prove-multiverse-real-180958813/

  92. Tan SC, Ting KM, Teng SW (2011) A general stochastic clustering method for automatic cluster discovery. Pattern Recogn 44(10-11):2786–2799

    Google Scholar 

  93. Turney PD Coherent keyphrase extraction via web mining. arXiv:cs/0308033

  94. Vishwakarma S, Nair PS, Rao DS A comparative study of k-means and k-medoid clustering for social media text mining, International Journal 2 (11), 297–302

  95. Wang Z, Hahn K, Kim Y, Song S, Seo J-M (2018) A news-topic recommender system based on keywords extraction. Multimedia Tools and Applications 77(4):4339–4353

    Google Scholar 

  96. Wang S, Koopman R (2017) Clustering articles based on semantic similarity. Scientometrics 111(2):1017–1031

    Google Scholar 

  97. Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42 (4):2264–2275

    Google Scholar 

  98. Witten IH, Medelyan O (2006) Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06), IEEE, pp 296–297

  99. Yang X-S, Deb S Engineering optimisation by cuckoo search. arXiv:1005.2908

  100. Zeng S, Tong X, Sang N (2014) Study on multi-center fuzzy c-means algorithm based on transitive closure and spectral clustering. Appl Soft Comput 16:89–101

    Google Scholar 

  101. Zhang C (2008) Automatic keyword extraction from documents using conditional random fields. J Comput Inform Systss 4(3):1169–1180

    Google Scholar 

  102. Zhang Z, Petrak J, Maynard D (2018) Adapted textrank for term extraction: a generic method of improving automatic term extraction algorithms. Procedia Computer Sci 137:102–108

    Google Scholar 

  103. Zhang Y, Zhang G, Chen H, Porter AL, Zhu D, Lu J (2016) Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technol Forecast Soc Chang 105:179–191

    Google Scholar 

  104. Zhao Y, Karypis G (2001) Criterion functions for document clustering: Experiments and analysis

Download references

Acknowledgements

This work was supported by universiti sains malaysia (USM) under Grant (1001/PK OMP/8014016).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ammar Kamal Abasi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abasi, A.K., Khader, A.T., Al-Betar, M.A. et al. A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering. Multimed Tools Appl 80, 37–82 (2021). https://doi.org/10.1007/s11042-020-09504-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09504-2

Keywords

Navigation