Abstract
Due to the widespread use of the Internet, users have the ease of accessing collections of university academic documents stored in virtual libraries whose information is of an unstructured type. In recent years, the production and publication of scientific documents in Ecuador have increased considerably, so the search and classification of documents is a fundamental task within information retrieval computer systems. Intelligent search systems allow found information with a high degree of accuracy and similarity. For the development of this project, academic documents from the Ecuador Network of Open Access Repositories (RRAAE) were retrieved using a glossary of terms in the area of science and technology. For the recovery of documents, the web scraping technique was used and its results were stored in a cloud database in JSON format. In the recovered documents, NLP techniques were applied to clean and homogenize the unstructured information. Two similarity metrics were used to measure the divergence between the retrieved documents, and similarity matrices were generated based on the title, keywords, and abstract, which were then unified into a weighted matrix. The results of the system are displayed in a web interface that, through the use of graphs, shows the relationship between the linked documents. The operation of the similarity system was validated through functional tests through experimentation with a collection of 30 queries with indexed and non-indexed terms in the input of the information retrieval system. The experiments showed that for indexed terms, the system performs better.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Assessing the quality of unstructured data: an initial overview
DSpace: An open source dynamic digital repository. https://doi.org/10.1045/january2003-smith. https://dspace.mit.edu/handle/1721.1/29465
Eitan, A.T., Smolyansky, E.: Connected papers. https://www.connectedpapers.com/ (2019)
Ammar, W., et al.: Construction of the literature graph in semantic scholar. In: NAACL (2018)
Cambria, E., White, B.: Jumping NLP Curves: a review of natural language processing research. IEEE Comput. Intell. Mag. 9(2), 48–57 (2014). https://doi.org/10.1109/MCI.2014.2307227
Gómez Mont, C., Martinez Pinto, C.: La inteligencia artificial al servicio del bien social en América Latina y el Caribe: Panorámica regional e instantáneas de doce países
Ekanayaka, S.: Combining institutional repositories and artificial intelligence: AI in Academia is Poised to Induce an Unfaltering Growth Stance in Research and Innovation. Research Information, pp. 40–41 (2020)
Fricke, S.: Semantic scholar. J. Med. Libr. Assoc. JMLA 106(1), 145 (2018)
Fruchterman, T.M., Reingold, E.M.: Graph drawing by force-directed placement. Softw. Pract. Exper. 21(11), 1129–1164 (1991)
Ghazi, A.N., Petersen, K., Reddy, S.S.V.R., Nekkanti, H.: Survey research in software engineering: problems and mitigation strategies. IEEE Access 7, 24703–24718 (2018)
Han, E.-H.S., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45372-5_46
Irvall, B., Nielsen, G.S.: Access to libraries for persons with disabilities: checklist. IFLA Professional Reports, No. 89. International Federation of Library Associations and Institutions (2005). https://eric.ed.gov/?id=ED494537 iSSN: 0168-1931 Publication Title: International Federation of Library Associations and Institutions (NJ1)
Kurian, S.K., Mathew, S.: Survey of scientific document summarization methods. Comput. Sci. 21, 3356 (2020)
Lee, M.D., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 27 (2005)
Mair, C., et al.: An investigation of machine learning based prediction systems. J. Syst. Softw. 53(1), 23–29 (2000). https://doi.org/10.1016/S0164-1212(00)00005-4. https://www.sciencedirect.com/science/article/pii/S0164121200000054
Mayr, P., et al.: Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL). Int. J. Digit. Libr. 19(2), 107–111 (2018)
McKiernan, G.: arXiv. org: The los alamos national laboratory e-print server. Int. J. Grey Literat. 1(3), 127–138 (2000)
Medin, D.L., Goldstone, R.L., Gentner, D.: Respects for similarity. Psychol. Rev. 100(2), 254–278 (1993). https://doi.org/10.1037/0033-295X.100.2.254. http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-295X.100.2.254
Mohammed, A.J., Yusof, Y., Husni, H.: Document clustering for knowledge discovery using nature-inspired algorithm (2014)
Pazmiño-Maji, R., Naranjo-Ordoñez, L., Conde-González, M., García-Peñalvo, F.: Learning analytics in Ecuador: an initial analysis based in a mapping review. In: Proceedings of the Seventh International Conference on Technological Ecosystems for Enhancing Multiculturality, pp. 304–311 (2019)
Saltos, W.R.F., Barcenes, V.A.B., Benavides, J.P.C.: Una mirada a los repositorios digitales en ecuador. RECIAMUC 2(1), 836–863 (2018)
Sánchez, D., Martínez-Sanahuja, L., Batet, M.: Survey and evaluation of web search engine hit counts as research tools in computational linguistics. Inf. Syst. 73, 50–60 (2018)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Sintoris, K., Vergidis, K.: Extracting business process models using natural language processing (NLP) techniques. In: 2017 IEEE 19th Conference on Business Informatics (CBI), vol. 01, pp. 135–139 (2017)
Sosin, A., et al.: How to increase the information assurance in the information age. J. Defense Resour. Manage. (JoDRM) 9(1), 45–57 (2018)
Sumba, F.: Red de repositorios de acceso abierto del ecuador-rraae. In: X Conferencia Internacional de Bibliotecas y Repositorios Digitales (BIREDIAL-ISTEC) (Modalidad virtual, 25 al 29 de octubre de 2021) (2021)
Suryakant, Mahara, T.: A new similarity measure based on mean measure of divergence for collaborative filtering in sparse environment. Procedia Comput. Sci. 89, 450–456 (2016). https://doi.org/10.1016/j.procs.2016.06.099. https://www.sciencedirect.com/science/article/pii/S1877050916311644
Tonon, L., Fusco, E.: Data mining as a tool for information retrieval in digital institutional repositories. Proceed. CSSS 2014, 180–183 (2014)
Vallejo-Huanga, D., Morillo, P., Ferri, C.: Semi-supervised clustering algorithms for grouping scientific articles. Procedia Comput. Sci. 108, 325–334 (2017)
Van Rossum, G., et al.: Python programming language. In: USENIX Annual Technical Conference, vol. 41, pp. 1–36. Santa Clara, CA (2007)
Vijayarani, S., Muthulakshmi, M.: Comparative analysis of Bayes and lazy classification algorithms. Int. J. Adv. Res. Comput. Commun. Eng. 2(8), 3118–3124 (2013)
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text mining: predictive methods for analyzing unstructured information. Springer Science & Business Media (2010). https://doi.org/10.1007/978-0-387-34555-0
White, J.: Pubmed 2.0. Med. Ref. Serv. Quart. 39(4), 382–387 (2020)
Yeh, A.S., Hirschman, L., Morgan, A.A.: Evaluation of text data mining for database curation: lessons learned from the KDD challenge cup. Bioinform. 19(suppl_1), 331–339 (2003)
Yue, X., Di, G., Yu, Y., Wang, W., Shi, H.: Analysis of the combination of natural language processing and search engine technology. Procedia Eng. 29, 1636–1639 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vallejo-Huanga, D., Jaime, J., Andrade, C. (2023). Similarity Visualizer Using Natural Language Processing in Academic Documents of the DSpace in Ecuador. In: Sserwanga, I., et al. Information for a Better World: Normality, Virtuality, Physicality, Inclusivity. iConference 2023. Lecture Notes in Computer Science, vol 13972. Springer, Cham. https://doi.org/10.1007/978-3-031-28032-0_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-28032-0_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28031-3
Online ISBN: 978-3-031-28032-0
eBook Packages: Computer ScienceComputer Science (R0)