Similarity Visualizer Using Natural Language Processing in Academic Documents of the DSpace in Ecuador

Vallejo-Huanga, Diego; Jaime, Janneth; Andrade, Carlos

doi:10.1007/978-3-031-28032-0_28

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13972))

Included in the following conference series:

International Conference on Information

1034 Accesses

Abstract

Due to the widespread use of the Internet, users have the ease of accessing collections of university academic documents stored in virtual libraries whose information is of an unstructured type. In recent years, the production and publication of scientific documents in Ecuador have increased considerably, so the search and classification of documents is a fundamental task within information retrieval computer systems. Intelligent search systems allow found information with a high degree of accuracy and similarity. For the development of this project, academic documents from the Ecuador Network of Open Access Repositories (RRAAE) were retrieved using a glossary of terms in the area of science and technology. For the recovery of documents, the web scraping technique was used and its results were stored in a cloud database in JSON format. In the recovered documents, NLP techniques were applied to clean and homogenize the unstructured information. Two similarity metrics were used to measure the divergence between the retrieved documents, and similarity matrices were generated based on the title, keywords, and abstract, which were then unified into a weighted matrix. The results of the system are displayed in a web interface that, through the use of graphs, shows the relationship between the linked documents. The operation of the similarity system was validated through functional tests through experimentation with a collection of 30 queries with indexed and non-indexed terms in the input of the information retrieval system. The experiments showed that for indexed terms, the system performs better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Assessing the quality of unstructured data: an initial overview
Google Scholar
DSpace: An open source dynamic digital repository. https://doi.org/10.1045/january2003-smith. https://dspace.mit.edu/handle/1721.1/29465
Eitan, A.T., Smolyansky, E.: Connected papers. https://www.connectedpapers.com/ (2019)
Ammar, W., et al.: Construction of the literature graph in semantic scholar. In: NAACL (2018)
Google Scholar
Cambria, E., White, B.: Jumping NLP Curves: a review of natural language processing research. IEEE Comput. Intell. Mag. 9(2), 48–57 (2014). https://doi.org/10.1109/MCI.2014.2307227
Article Google Scholar
Gómez Mont, C., Martinez Pinto, C.: La inteligencia artificial al servicio del bien social en América Latina y el Caribe: Panorámica regional e instantáneas de doce países
Google Scholar
Ekanayaka, S.: Combining institutional repositories and artificial intelligence: AI in Academia is Poised to Induce an Unfaltering Growth Stance in Research and Innovation. Research Information, pp. 40–41 (2020)
Google Scholar
Fricke, S.: Semantic scholar. J. Med. Libr. Assoc. JMLA 106(1), 145 (2018)
Google Scholar
Fruchterman, T.M., Reingold, E.M.: Graph drawing by force-directed placement. Softw. Pract. Exper. 21(11), 1129–1164 (1991)
Google Scholar
Ghazi, A.N., Petersen, K., Reddy, S.S.V.R., Nekkanti, H.: Survey research in software engineering: problems and mitigation strategies. IEEE Access 7, 24703–24718 (2018)
Article Google Scholar
Han, E.-H.S., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45372-5_46
Chapter Google Scholar
Irvall, B., Nielsen, G.S.: Access to libraries for persons with disabilities: checklist. IFLA Professional Reports, No. 89. International Federation of Library Associations and Institutions (2005). https://eric.ed.gov/?id=ED494537 iSSN: 0168-1931 Publication Title: International Federation of Library Associations and Institutions (NJ1)
Kurian, S.K., Mathew, S.: Survey of scientific document summarization methods. Comput. Sci. 21, 3356 (2020)
Google Scholar
Lee, M.D., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 27 (2005)
Google Scholar
Mair, C., et al.: An investigation of machine learning based prediction systems. J. Syst. Softw. 53(1), 23–29 (2000). https://doi.org/10.1016/S0164-1212(00)00005-4. https://www.sciencedirect.com/science/article/pii/S0164121200000054
Mayr, P., et al.: Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL). Int. J. Digit. Libr. 19(2), 107–111 (2018)
Article Google Scholar
McKiernan, G.: arXiv. org: The los alamos national laboratory e-print server. Int. J. Grey Literat. 1(3), 127–138 (2000)
Medin, D.L., Goldstone, R.L., Gentner, D.: Respects for similarity. Psychol. Rev. 100(2), 254–278 (1993). https://doi.org/10.1037/0033-295X.100.2.254. http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-295X.100.2.254
Mohammed, A.J., Yusof, Y., Husni, H.: Document clustering for knowledge discovery using nature-inspired algorithm (2014)
Google Scholar
Pazmiño-Maji, R., Naranjo-Ordoñez, L., Conde-González, M., García-Peñalvo, F.: Learning analytics in Ecuador: an initial analysis based in a mapping review. In: Proceedings of the Seventh International Conference on Technological Ecosystems for Enhancing Multiculturality, pp. 304–311 (2019)
Google Scholar
Saltos, W.R.F., Barcenes, V.A.B., Benavides, J.P.C.: Una mirada a los repositorios digitales en ecuador. RECIAMUC 2(1), 836–863 (2018)
Google Scholar
Sánchez, D., Martínez-Sanahuja, L., Batet, M.: Survey and evaluation of web search engine hit counts as research tools in computational linguistics. Inf. Syst. 73, 50–60 (2018)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
Sintoris, K., Vergidis, K.: Extracting business process models using natural language processing (NLP) techniques. In: 2017 IEEE 19th Conference on Business Informatics (CBI), vol. 01, pp. 135–139 (2017)
Google Scholar
Sosin, A., et al.: How to increase the information assurance in the information age. J. Defense Resour. Manage. (JoDRM) 9(1), 45–57 (2018)
Google Scholar
Sumba, F.: Red de repositorios de acceso abierto del ecuador-rraae. In: X Conferencia Internacional de Bibliotecas y Repositorios Digitales (BIREDIAL-ISTEC) (Modalidad virtual, 25 al 29 de octubre de 2021) (2021)
Google Scholar
Suryakant, Mahara, T.: A new similarity measure based on mean measure of divergence for collaborative filtering in sparse environment. Procedia Comput. Sci. 89, 450–456 (2016). https://doi.org/10.1016/j.procs.2016.06.099. https://www.sciencedirect.com/science/article/pii/S1877050916311644
Tonon, L., Fusco, E.: Data mining as a tool for information retrieval in digital institutional repositories. Proceed. CSSS 2014, 180–183 (2014)
Google Scholar
Vallejo-Huanga, D., Morillo, P., Ferri, C.: Semi-supervised clustering algorithms for grouping scientific articles. Procedia Comput. Sci. 108, 325–334 (2017)
Article Google Scholar
Van Rossum, G., et al.: Python programming language. In: USENIX Annual Technical Conference, vol. 41, pp. 1–36. Santa Clara, CA (2007)
Google Scholar
Vijayarani, S., Muthulakshmi, M.: Comparative analysis of Bayes and lazy classification algorithms. Int. J. Adv. Res. Comput. Commun. Eng. 2(8), 3118–3124 (2013)
Google Scholar
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text mining: predictive methods for analyzing unstructured information. Springer Science & Business Media (2010). https://doi.org/10.1007/978-0-387-34555-0
White, J.: Pubmed 2.0. Med. Ref. Serv. Quart. 39(4), 382–387 (2020)
Google Scholar
Yeh, A.S., Hirschman, L., Morgan, A.A.: Evaluation of text data mining for database curation: lessons learned from the KDD challenge cup. Bioinform. 19(suppl_1), 331–339 (2003)
Google Scholar
Yue, X., Di, G., Yu, Y., Wang, W., Shi, H.: Analysis of the combination of natural language processing and search engine technology. Procedia Eng. 29, 1636–1639 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Politécnica Salesiana, IDEIAGEOCA Research Group, Quito, Ecuador
Diego Vallejo-Huanga
Universidad Politécnica Salesiana, Department of Computer Science, Quito, Ecuador
Janneth Jaime & Carlos Andrade

Authors

Diego Vallejo-Huanga
View author publications
You can also search for this author in PubMed Google Scholar
Janneth Jaime
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Andrade
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Vallejo-Huanga .

Editor information

Editors and Affiliations

iSchool Organization, Berlin, Germany
Isaac Sserwanga
Victoria University of Wellington, Wellington, New Zealand
Anne Goulding
University of Missouri, Chicago, IL, USA
Heather Moulaison-Sandy
University of South Australia, Adelaide, SA, Australia
Jia Tina Du
University of Porto, Porto, Portugal
António Lucas Soares
Monash University, Clayton, VIC, Australia
Viviane Hessami
University of Tennessee at Knoxville, Knoxville, TN, USA
Rebecca D. Frank

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vallejo-Huanga, D., Jaime, J., Andrade, C. (2023). Similarity Visualizer Using Natural Language Processing in Academic Documents of the DSpace in Ecuador. In: Sserwanga, I., et al. Information for a Better World: Normality, Virtuality, Physicality, Inclusivity. iConference 2023. Lecture Notes in Computer Science, vol 13972. Springer, Cham. https://doi.org/10.1007/978-3-031-28032-0_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-28032-0_28
Published: 10 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28031-3
Online ISBN: 978-3-031-28032-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Similarity Visualizer Using Natural Language Processing in Academic Documents of the DSpace in Ecuador