Skip to main content
Log in

Optimization of scientific publications clustering with ensemble approach for topic extraction

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The continually developing Internet generates a considerable amount of text data. When attempting to extract general topics or themes from a massive corpus of documents, dealing with such a large volume of text data in an unstructured format is a big problem. Text document clustering (TDC) is a technique for grouping texts based on their content similarity. Partitioning text collection based on the documents’ content significance is one of the most challenging tasks at TDC. This study proposes the Bare-Bones Based Salp Swarm Algorithm (BBSSA) to solve the problem of TDC. In addition, to extract the topics from the clusters, an ensemble approach for automatic topic extraction (TE) is proposed. The proposed BBSSA and the ensemble TE approach are tested using six standard benchmarks and six scientific publishing datasets from top QS ranking UAE universities. BBSSA’s findings are compared with sixteen well-known techniques, including eleven metaheuristic algorithms, such as the Whale Optimization Algorithm (WOA), Firefly Algorithm (FFA), Bat Algorithm (BAT), Harmony Search (HS), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Multi-Verse Optimizer (MVO), Grey Wolf Optimizer (GWO), Moth-Flame Optimization (MFO), Krill Herd Algorithm (KHA), SSA, and five clustering methods, such as K-means++, K-means, Density-based Spatial Clustering of Applications with Noise (DBSCAN), Spectral, and Agglomerative. The results of the ensemble TE approach are compared with those of seven well-known statistical methods, including Mutual Information (MI), TextRank (TR), Co-Occurrence Statistical Information-based Keyword Extraction (CSI), Term Frequency-Inverse Document Frequency (TF-IDF), most frequent based keyword extraction (TF), YAKE!, and RAKE. According to the experiments, the BBSSA outperforms all other approaches and is exceedingly competitive. The results also reveal that for most datasets, the proposed ensemble TE strategy outperforms all existing TE methods based on external metrics. Thus, the ensemble TE approach can be seen as a supplement to the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Makhadmeh, S. N., & Alyasseri, Z. A. A. (2019a). An improved text feature selection for clustering using binary grey wolf optimizer. Proceedings of the 11th National Technical Seminar on Unmanned System Technology 2019 (pp. 503–516). Springer.

  • Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Makhadmeh, S. N., & Alyasseri, Z. A. A. (2019). A text feature selection technique based on binary multi-verse optimizer for text clustering. 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT) (pp. 1–6). IEEE.

  • Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Alyasseri, Z. A. A., & Makhadmeh, S. N. (2020). A novel hybrid multi-verse optimizer with k-means for text documents clustering. Neural Computing & Applications, 32, 17703–17729.

    Article  Google Scholar 

  • Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Makhadmeh, S. N., & Alyasseri, Z. A. A. (2020). Link-based multi-verse optimizer for text documents clustering. Applied Soft Computing, 87, 106002.

    Article  Google Scholar 

  • Abasi, A. K., Khader, A. T., Al-Betar, M. A., Alyasseri, Z. A. A., Makhadmeh, S. N., Al-laham, M., & Naim, S. (2021a). A hybrid salp swarm algorithm with \(\beta\)-hill climbing algorithm for text documents clustering. Evolutionary Data Clustering: Algorithms and Applications (p. 129).

  • Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Alyasseri, Z. A. A., & Makhadmeh, S. N. (2021). An ensemble topic extraction approach based on optimization clusters using hybrid multi-verse optimizer for scientific publications. Journal of Ambient Intelligence and Humanized Computing, 12(2), 2765–2801.

    Article  Google Scholar 

  • Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Makhadmeh, S. N., & Alyasseri, Z. A. A. (2021). A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering. Multimedia Tools and Applications, 80(1), 37–82.

    Article  Google Scholar 

  • Abasi, A. K., Makhadmeh, S. N., Al-Betar, M. A., Alomari, O. A., Awadallah, M. A., Alyasseri, Z. A. A., Doush, I. A., Elnagar, A., Alkhammash, E. H., & Hadjouni, M. (2022). Lemurs optimizer: A new metaheuristic algorithm for global optimization. Applied Sciences, 12(19), 10057.

    Article  Google Scholar 

  • Al-Betar, M. A., Abasi, A. K., Al-Naymat, G., Arshad, K., & Makhadmeh, S. N. (2022). https://www.kaggle.com/ammarabbasi/datasets

  • Alyasiri, O. M., Cheah, Y. N., Abasi, A. K., & Al-Janabi, O. M. (2022). Wrapper and hybrid feature selection methods using metaheuristic algorithms for English text classification: A systematic review. IEEE Access, 10, 39833–39852.

    Article  Google Scholar 

  • Alyasseri, Z. A. A., Khadeer, A. T., Al-Betar, M. A., Abasi, A., Makhadmeh, S., & Ali, N. S. (2019). The effects of eeg feature extraction using multi-wavelet decomposition for mental tasks classification. Proceedings of the International Conference on Information and Communication Technology (pp. 139–146).

  • Alyasseri, Z. A. A., Khader, A. T., Al-Betar, M. A., Abasi, A. K., & Makhadmeh, S. N. (2019). Eeg signals denoising using optimal wavelet transform hybridized with efficient metaheuristic methods. IEEE Access, 8, 10584–10605.

    Article  Google Scholar 

  • Alyasseri, Z. A. A., Al-Betar, M. A., Awadallah, M. A., Makhadmeh, S. N., Abasi, A. K., Doush, I. A., & Alomari, O. A. (2021). A hybrid flower pollination with \(\beta\)-hill climbing algorithm for global optimization. Journal of King Saud University-Computer and Information Sciences, 34, 4821–4835.

    Article  Google Scholar 

  • Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027–1035). Society for Industrial and Applied Mathematics.

  • Beliga, S., Meštrović, A., & Martinčić-Ipšić, S. (2015). An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences, 39(1), 1–20.

    Google Scholar 

  • Bezdan, T., Stoean, C., Naamany, A. A., Bacanin, N., Rashid, T. A., Zivkovic, M., & Venkatachalam, K. (2021). Hybrid fruit-fly optimization algorithm with k-means for text document clustering. Mathematics, 9(16), 1929.

    Article  Google Scholar 

  • Bharti, K. K., & Singh, P. K. (2016). Opposition chaotic fitness mutation based adaptive inertia weight bpso for feature selection in text clustering. Applied Soft Computing, 43, 20–34.

    Article  Google Scholar 

  • Boley, D., Gini, M., Gross, R., Han, E. H. S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the world wide web using webace. Artificial Intelligence Review, 13(5–6), 365–391.

    Article  Google Scholar 

  • Bolufé-Röhler, A., & Tamayo-Vera, D. (2020). Machine learning based metaheuristic hybrids for s-box optimization. Journal of Ambient Intelligence and Humanized Computing, 11, 5139–5152.

    Article  Google Scholar 

  • Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222.

    Article  Google Scholar 

  • Bouras, C., & Tsogkas, V. (2012). A clustering technique for news articles using wordnet. Knowledge-Based Systems, 36, 115–128.

    Article  Google Scholar 

  • Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A. (2020). Yake! keyword extraction from single documents using multiple local features. Information Sciences, 509, 257–289.

    Article  Google Scholar 

  • Chandran, T. R., Reddy, A., & Janet, B. (2016). A social spider optimization approach for clustering text documents. 2016 2nd International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB) (pp. 22–26). IEEE.

  • Chandran, T. R., Reddy, A., & Janet, B. (2017). Text clustering quality improvement using a hybrid social spider optimization. International Journal of Applied Engineering Research, 12(6), 995–1008.

    Google Scholar 

  • Cobo, A., & Rocha, R. (2011). Document management with ant colony optimization metaheuristic: A fuzzy text clustering approach using pheromone trails. Soft Computing in Industrial Applications (pp. 261–270). Springer.

  • Davidson, I., & Ravi, S. (2005). Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. European conference on principles of data mining and knowledge discovery (pp. 59–70). Springer.

  • Dua, D., & Graff, C. (2017). UCI machine learning repository http://archive.ics.uci.edu/ml

  • Duari, S., & Bhatnagar, V. (2019). scake: Semantic connectivity aware keyword extraction. Information Sciences, 477, 100–117.

    Article  Google Scholar 

  • Emrouznejad, A., & Gl, Yang. (2018). A survey and analysis of the first 40 years of scholarly literature in dea: 1978–2016. Socio-Economic Planning Sciences, 61, 4–8.

    Article  Google Scholar 

  • Gandomi, A. H., & Alavi, A. H. (2012). Krill herd: A new bio-inspired optimization algorithm. Communications in Nonlinear Science and Numerical Simulation, 17(12), 4831–4845.

    Article  MathSciNet  MATH  Google Scholar 

  • Gerlach, M., Peixoto, T. P., & Altmann, E. G. (2018). A network approach to topic models. Science Advances, 4(7), eaaq1360.

    Article  Google Scholar 

  • Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning. Machine Learning, 3(2), 95–99.

    Article  Google Scholar 

  • Gopal, J., & Brunda, S. (2019). Text clustering algorithm using fuzzy whale optimization algorithm. International Journal of Intelligent Engineering and System, 12(2), 278–286.

    Article  Google Scholar 

  • Grineva, M., Grinev, M., & Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents. Proceedings of the 18th international conference on World Wide Web (pp. 661–670). ACM.

  • Gündoğan, E., & Kaya, M. (2022). A novel hybrid paper recommendation system using deep learning. Scientometrics, 127, 3837–3855.

    Article  Google Scholar 

  • Hasanzadeh, E., Rokny, H. A., et al. (2012). Text clustering on latent semantic indexing with particle swarm optimization (pso) algorithm. International Journal of Physical Sciences, 7(1), 16–120.

    Google Scholar 

  • Holland, J. H. (1992). Genetic algorithms. Scientific American, 267(1), 66–73.

    Article  Google Scholar 

  • Huang, C., Tian, Y., Zhou, Z., Ling, C. X., & Huang, T. (2006). Keyphrase extraction using semantic networks structure analysis. Sixth International Conference on Data Mining (ICDM’06) (pp. 275–284). IEEE.

  • Ienco, D., & Bordogna, G. (2018). Fuzzy extensions of the dbscan clustering algorithm. Soft Computing, 22(5), 1719–1730.

    Article  MATH  Google Scholar 

  • Janani, R., & Vijayarani, S. (2019). Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Systems with Applications, 134, 192–200.

    Article  Google Scholar 

  • Kennedy, J. (2003). Bare bones particle swarms. Proceedings of IEEE Swarm Intelligence Symposium. SIS, 3, 26–26.

    Google Scholar 

  • Koopman, R., & Wang, S. (2017). Mutual information based labelling and comparing clusters. Scientometrics, 111(2), 1157–1167.

    Article  Google Scholar 

  • Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics: Browsing through the universe of bibliographic information. Scientometrics, 111(2), 1119–1139.

    Article  Google Scholar 

  • Liu, B. (2020). Text sentiment analysis based on cbow model and deep learning in big data environment. Journal of Ambient Intelligence and Humanized Computing, 11(2), 451–458.

    Article  Google Scholar 

  • Liu, Z., Huang, W., Zheng, Y., & Sun, M. (2010). Automatic keyphrase extraction via topic decomposition. Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 366–376). Association for Computational Linguistics.

  • Lu, Y., Liang, M., Ye, Z., & Cao, L. (2015). Improved particle swarm optimization algorithm and its application in text feature selection. Applied Soft Computing, 35, 629–636.

    Article  Google Scholar 

  • MacQueen, J., et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 14, pp. 281–297).

  • Madin, L. P. (1990). Aspects of jet propulsion in salps. Canadian Journal of Zoology, 68(4), 765–777.

    Article  Google Scholar 

  • Majhi, S. K. (2021). Fuzzy clustering algorithm based on modified whale optimization algorithm for automobile insurance fraud detection. Evolutionary intelligence, 14(1), 35–46.

    Article  Google Scholar 

  • Makhadmeh, S. N., Khader, A. T., Al-Betar, M. A., Naim, S., Abasi, A. K., & Alyasseri, Z. A. A. (2019). Optimization methods for power scheduling problems in smart home: Survey. Renewable and Sustainable Energy Reviews, 115, 109362.

    Article  Google Scholar 

  • Makhadmeh, S. N., Khader, A. T., Al-Betar, M. A., Naim, S., Alyasseri, Z. A. A., & Abasi, A. K. (2019). Particle swarm optimization algorithm for power scheduling problem using smart battery. 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT) (pp. 672–677). IEEE.

  • Makhadmeh, S. N., Al-Betar, M. A., Awadallah, M. A., Abasi, A. K., Alyasseri, Z. A. A., Doush, I. A., Alomari, O. A., Damaševičius, R., Zajančkauskas, A., & Mohammed, M. A. (2022). A modified coronavirus herd immunity optimizer for the power scheduling problem. Mathematics, 10(3), 315.

    Article  Google Scholar 

  • Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01), 157–169.

    Article  Google Scholar 

  • Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 404–411). Springer.

  • Mirjalili, S. (2015). Moth-flame optimization algorithm: A novel nature-inspired heuristic paradigm. Knowledge-Based systems, 89, 228–249.

    Article  Google Scholar 

  • Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey wolf optimizer. Advances in Engineering Software, 69, 46–61.

    Article  Google Scholar 

  • Mirjalili, S., Mirjalili, S. M., & Hatamlou, A. (2016). Multi-verse optimizer: A nature-inspired algorithm for global optimization. Neural Computing and Applications, 27(2), 495–513.

    Article  Google Scholar 

  • Mirjalili, S., Gandomi, A. H., Mirjalili, S. Z., Saremi, S., Faris, H., & Mirjalili, S. M. (2017). Salp swarm algorithm: A bio-inspired optimizer for engineering design problems. Advances in Engineering Software, 114, 163–191.

    Article  Google Scholar 

  • Moh’d Alia, O., Al-Betar, M. A., Mandava, R., & Khader, A. T. (2011). Data clustering using harmony search algorithm. International Conference on Swarm, Evolutionary, and Memetic Computing (pp. 79–88). Springer.

  • Nema, P., & Sharma, V. (2015). Multi-label text categorization based on feature optimization using ant colony optimization and relevance clustering technique. 2015 International Conference on Computers, Communications, and Systems (ICCCS) (pp. 1–5). IEEE.

  • Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232–247.

    Article  Google Scholar 

  • Park, H. S., & Jun, C. H. (2009). A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2), 3336–3341.

    Article  Google Scholar 

  • Patel, M. R. R. (2017). An improved document clustering with multiview point similarity/dissimilarity measures. International Journal of Engineering and Computer Science, 6(2), 12.

    Google Scholar 

  • Pay, T., & Lucci, S. (2017). Automatic keyword extraction: An ensemble method. 2017 IEEE International Conference on Big Data (Big Data) (pp. 4816–4818). IEEE.

  • Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach. Journal of Informetrics, 3(2), 143–157.

    Article  Google Scholar 

  • Rana, S., Jasola, S., & Kumar, R. (2011). A review on particle swarm optimization algorithms and their applications to data clustering. Artificial Intelligence Review, 35(3), 211–222.

    Article  Google Scholar 

  • Role, F., & Nadif, M. (2014). Beyond cluster labeling: Semantic interpretation of clusters’ contents using a graph representation. Knowledge-Based Systems, 56, 141–155.

    Article  Google Scholar 

  • Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1–20.

    Google Scholar 

  • Seifert, C., Ulbrich, E., & Granitzer, M. (2011). Word clouds for efficient document labeling. International Conference on Discovery Science (pp. 292–306). Springer.

  • Shafiabady, N., Lee, L. H., Rajkumar, R., Kallimani, V., Akram, N. A., & Isa, D. (2016). Using unsupervised clustering approach to train the support vector machine for text classification. Neurocomputing, 211, 4–10.

    Article  Google Scholar 

  • Shaikh, Z. A. (2018). Keyword detection techniques: A comprehensive study. Engineering, Technology & Applied Science Research, 8(1), 2590–2594.

    Article  Google Scholar 

  • Turney, P. D. (2003). Coherent keyphrase extraction via web mining. arXiv preprint cs/0308033

  • Van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using citnetexplorer and vosviewer. Scientometrics, 111(2), 1053–1070.

    Article  Google Scholar 

  • Vara, N., Mirzabeigi, M., Sotudeh, H., & Fakhrahmad, S. M. (2022). Application of k-means clustering algorithm to improve effectiveness of the results recommended by journal recommender system. Scientometrics, 127, 3237–3252.

    Article  Google Scholar 

  • Velden, T., Boyack, K. W., Gläser, J., Koopman, R., Scharnhorst, A., & Wang, S. (2017). Comparison of topic extraction approaches and their results. Scientometrics, 111(2), 1169–1221.

    Article  Google Scholar 

  • Vetriselvi, T., & Gopalan, N. (2020). An improved key term weightage algorithm for text summarization using local context information and fuzzy graph sentence score. Journal of Ambient Intelligence and Humanized Computing, 12, 1–10.

    Google Scholar 

  • Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. Scientometrics, 111(2), 1017–1031.

    Article  Google Scholar 

  • Wang, Y., Zhang, C., & Li, K. (2022). A review on method entities in the academic literature: Extraction, evaluation, and application. Scientometrics, 2022, 1–42.

    Google Scholar 

  • Wang, Z., Hahn, K., Kim, Y., Song, S., & Seo, J. M. (2018). A news-topic recommender system based on keywords extraction. Multimedia Tools and Applications, 77(4), 4339–4353.

    Article  Google Scholar 

  • Wilcoxon, F. (1992). Individual comparisons by ranking methods. Breakthroughs in Statistics (pp. 196–202). Berlin: Springer.

  • Witten, I. H., & Medelyan, O. (2006). Thesaurus based automatic keyphrase indexing. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’06) (pp. 296–297). Berlin: IEEE.

  • Yang, X. S. (2009). Firefly algorithms for multimodal optimization. International symposium on stochastic algorithms (pp. 169–178). Berlin: Springer.

    Google Scholar 

  • Yang, X. S., & Hossein Gandomi, A. (2012). Bat algorithm: A novel approach for global engineering optimization. Engineering Computations, 29(5), 464–483.

    Article  Google Scholar 

  • Zeng, S., Tong, X., & Sang, N. (2014). Study on multi-center fuzzy c-means algorithm based on transitive closure and spectral clustering. Applied Soft Computing, 16, 89–101.

    Article  Google Scholar 

  • Zhang, C., Zhao, L., Zhao, M., & Zhang, Y. (2022). Enhancing keyphrase extraction from academic articles with their reference information. Scientometrics, 127(2), 703–731.

    Article  MathSciNet  Google Scholar 

  • Zhang, R., & Yuan, J. (2022). Enhanced author bibliographic coupling analysis using semantic and syntactic citation information. Scientometrics, 127, 7681–7706.

    Article  Google Scholar 

  • Zhang, Y., Zhang, G., Chen, H., Porter, A. L., Zhu, D., & Lu, J. (2016). Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technological Forecasting and Social Change, 105, 179–191.

    Article  Google Scholar 

  • Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and Analysis. Department of Computer Science, University of Minnesota.

Download references

Acknowledgements

This work was supported by grant from the Deanship of Graduate Studies and Research (DGSR) at Ajman University, Ajman, UAE (Grant No. 2021-IRG-ENIT-6).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed Azmi Al-Betar.

Appendix

Appendix

Minkowski distances:

$$\begin{aligned} D(d_1,d_2)=\displaystyle \left( \sum _{i=1}^{n}|t_{1,i} - t_{2,i}|^p \right) ^{1/p} \end{aligned}$$

See Tables 10, 11, 12 and 13.

Table 10 Term weight using TF-IDF
Table 11 Clusters centroids
Table 12 Example of \(a_{ki}\) where the \(k=1\)
Table 13 Total average distance of the documents from the cluster centroid (ADDC)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Betar, M.A., Abasi, A.K., Al-Naymat, G. et al. Optimization of scientific publications clustering with ensemble approach for topic extraction. Scientometrics 128, 2819–2877 (2023). https://doi.org/10.1007/s11192-023-04674-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-023-04674-w

Keywords

Navigation