Abstract
This article presents a validation study of the algorithm implemented in the text mining tool called SOBEK, comparing it with YAKE!’, a known unsupervised keyword extraction algorithm. Both algorithms identify keywords from single documents using mainly a statistical method, providing context independent information. The article describes briefly previous uses of SOBEK in the literature, and presents a detailed description of its text mining algorithm. The validation study presented in the paper compares SOBEK with YAKE!. Both systems were used to extract keywords from texts belonging to fourteen public text databases, each containing several documents. In general, their performance was found to be equivalent, with the algorithms outperforming one another in a batch of tests, and reaching similar results in others. Understanding why each algorithm outperformed the other in different circumstances may shed light on the advantages and disadvantages of specific features of keyword extraction methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Allahyari, M.: A brief survey of text mining: classification, clustering and extraction techniques. In: Proceedings of KDD Bigdas (2017). http://arxiv.org/abs/1707.02919
Azevedo, B.F.T., Reategui, E.B., Behar, P.A.: Analysis of the relevance of posts in asynchronous discussions. Interdisc. J. E-Learning Learn. Objects 10, 107–121 (2014). https://doi.org/10.28945/2064
Bromberg, C.: History of science: the problem of cataloging, knowledge indexing and information retrieval in the digital space. Circumscribere: Int. J. Hist. Sc. 21, 41 (2018). https://doi.org/10.23925/1980-7651.2018v21;p41-55
Campos, R.: Datasets of automatic keyphrase extraction (2020). https://github.com/LIAAD/KeywordExtractor-Datasets
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020). https://doi.org/10.1016/J.INS.2019.09.013
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of Deep Bidirectional Transformers for Language Understanding. Cornell University (2019). https://doi.org/10.48550/arXiv.1810.04805
El-Kassas, W.S., Salama, C.R., Rafea, A.A., Mohamed, H.K.: Automatic text summarization: a comprehensive survey. Expert Syst. Appl. 165, 113679 (2021). https://doi.org/10.1016/j.eswa.2020.113679
Feldman, R., Sanger, J.: Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)
Firoozeh, N., Nazarenko, A., Alizon, F., Daille, B.: Keyword extraction: Issues and methods. Nat. Lang. Eng. 26(3), 259–291 (2019). https://doi.org/10.1017/S1351324919000457
Flor, M., Hao, J.: Text mining and automated scoring. In: von Davier, A.A., Mislevy, R.J., Hao, J. (eds.) Computational Psychometrics: New Methodologies for a New Generation of Digital Learning and Assessment. Methodology of Educational Measurement and Assessment. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74394-9_14
Führ, F., Bisset Alvarez, E.: Digital humanities and open science: initial aspects. In: Bisset Álvarez, E. (ed.) DIONE 2021. LNICSSITE, vol. 378, pp. 154–173. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77417-2_12
Gonzalez-Gonzalez, C.S., Moreno, L., Popescu, B., Lotero, Y., Vargas, R.: Intelligent systems to support the active self-learning in industrial automation. In: IEEE Global Engineering Education Conference, EDUCON, 10–13 April 2016, pp. 1149–1154 (2016). https://doi.org/10.1109/EDUCON.2016.7474700
Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference, vol. 1, pp. 1262–1273 (2014). https://doi.org/10.3115/V1/P14-1119
Holzinger, A., Malle, B., Saranti, A., Pfeifer, B.: Towards a multi-modal causability with graph neural networks enabling information fusion for explainable ai. Inf. Fusion 71, 28–37 (2021). https://doi.org/10.1016/j.inffus.2021.01.008
Hulth, A., Megyesi, B.B.: A study on automatically extracted keywords in text categorization. In: COLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, vol. 1, pp. 537–544 (2006). https://doi.org/10.3115/1220175.1220243
Karami, A., Ghasemi, M., Sen, S., Moraes, M.F., Shah, V.: Exploring diseases and syndromes in neurology case reports from 1955 to 2017 with text mining. Comput. Biol. Med. 109(February), 322–332 (2019). https://doi.org/10.1016/j.compbiomed.2019.04.008
Krallinger, M., Valencia, A.: Text-mining and information-retrieval services for molecular biology (2005). https://doi.org/10.1186/gb-2005-6-7-224
Lamurias, A., Couto, F.M.: Text mining for bioinformatics using biomedical literature. In Encyclopedia of Bioinformatics and Computational Biology. Elsevier Ltd. (2019). https://doi.org/10.1016/b978-0-12-809633-8.20409-3
Lee, A.V.Y., Tan, S.C., Lee, A.V.Y., Tan, S.C.: Discovering dynamics of an idea pipeline: understanding idea development within a knowledge building discourse. In: Proceedings of the 25th International Conference on Computers in Education, pp. 119–128 (2017). https://repository.nie.edu.sg//handle/10497/19430
Lee, A.V.Y., Tan, S.C.: Promising ideas for collective advancement of communal knowledge using temporal analytics and cluster analysis. J. Learn. Anal. 4(3), 76–101 (2017). https://doi.org/10.18608/jla.2017.43.5
Macedo, A.L., Reategui, E., Lorenzatti, A., Behar, P.: Using text-mining to support the evaluation of texts produced collaboratively. In: Proceedings of IFIP World Conference on Computers in Education, Bento Gonçalves, Brazil (2009)
Marcos-Pablos, S., García-Peñalvo, F.J.: Information retrieval methodology for aiding scientific database search. Soft. Comput. 24(8), 5551–5560 (2018). https://doi.org/10.1007/s00500-018-3568-0
Noh, H., Jo, Y., Lee, S.: Keyword selection and processing strategy for applying text mining to patent analysis. Expert Syst. Appl. 42(9), 4348–4360 (2015). https://doi.org/10.1016/j.eswa.2015.01.050
Novak, J.D., Cañas, A.J.: The theory underlying concept maps and how to construct them (2008)
Pang, B., Lee, L.: Opinion mining and sentiment analysis. In: Foundations and Trends in Information Retrieval, vol. 2, issue number 2 (2008)
Reategui, E., Epstein, D., Bastiani, E., Carniato, M.: Can text mining support reading comprehension? In: Gennari, R., et al. (eds.) MIS4TEL 2019. AISC, vol. 1007, pp. 37–44. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-23990-9_5
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. Text Min. Appl. Theory 1–20 (2010). https://doi.org/10.1002/9780470689646.CH1
Schenker, A.: Graph-Theoretic Techniques for Web Content Mining Graph-Theoretic Techniques for Web Content Mining. University of South Florida (2003). https://scholarcommons.usf.edu/etd
Song, B., Yan, W., Zhang, T.: Cross-border e-commerce commodity risk assessment using text mining and fuzzy rule-based reasoning. Adv. Eng. Inform. 40(January), 69–80 (2019). https://doi.org/10.1016/j.aei.2019.03.002
Sun, A., Lachanski, M., Fabozzi, F.J.: Trade the tweet: social media text mining and sparse matrix factorization for stock market prediction. Int. Rev. Financ. Anal. 48, 272–281 (2016). https://doi.org/10.1016/j.irfa.2016.10.009
Tseng, Y.-H., Lin, C.-J., Lin, Y.-I.: Text mining techniques for patent analysis automatic information organization view project Chinese grammatical error diagnosis view project text mining techniques for patent Analysis. Inf. Process. Manage. 43, 1216–1247 (2007). https://doi.org/10.1016/j.ipm.2006.11.011
Winograd, P.N.: Strategic Difficulties in Summarizing Texts. University of Illinois at Urbana-Champaign, Cambridge (1983)
Zvarevashe, K., Olugbara, O.O.: A framework for sentiment analysis with opinion mining of hotel reviews. In: Proceedings of the Conference on Information Communications Technology and Society (ICTAS), Durban, South Africa, 8–9 March, pp. 1–4 (2018). https://doi.org/10.1109/ICTAS.2018.8368746
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Reategui, E., Bigolin, M., Carniato, M., dos Santos, R.A. (2022). Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2022. Lecture Notes in Computer Science, vol 13480. Springer, Cham. https://doi.org/10.1007/978-3-031-14463-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-14463-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14462-2
Online ISBN: 978-3-031-14463-9
eBook Packages: Computer ScienceComputer Science (R0)