Skip to main content

Similarity Visualizer Using Natural Language Processing in Academic Documents of the DSpace in Ecuador

  • Conference paper
  • First Online:
Information for a Better World: Normality, Virtuality, Physicality, Inclusivity (iConference 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13972))

Included in the following conference series:

  • 1034 Accesses

Abstract

Due to the widespread use of the Internet, users have the ease of accessing collections of university academic documents stored in virtual libraries whose information is of an unstructured type. In recent years, the production and publication of scientific documents in Ecuador have increased considerably, so the search and classification of documents is a fundamental task within information retrieval computer systems. Intelligent search systems allow found information with a high degree of accuracy and similarity. For the development of this project, academic documents from the Ecuador Network of Open Access Repositories (RRAAE) were retrieved using a glossary of terms in the area of science and technology. For the recovery of documents, the web scraping technique was used and its results were stored in a cloud database in JSON format. In the recovered documents, NLP techniques were applied to clean and homogenize the unstructured information. Two similarity metrics were used to measure the divergence between the retrieved documents, and similarity matrices were generated based on the title, keywords, and abstract, which were then unified into a weighted matrix. The results of the system are displayed in a web interface that, through the use of graphs, shows the relationship between the linked documents. The operation of the similarity system was validated through functional tests through experimentation with a collection of 30 queries with indexed and non-indexed terms in the input of the information retrieval system. The experiments showed that for indexed terms, the system performs better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Assessing the quality of unstructured data: an initial overview

    Google Scholar 

  2. DSpace: An open source dynamic digital repository. https://doi.org/10.1045/january2003-smith. https://dspace.mit.edu/handle/1721.1/29465

  3. Eitan, A.T., Smolyansky, E.: Connected papers. https://www.connectedpapers.com/ (2019)

  4. Ammar, W., et al.: Construction of the literature graph in semantic scholar. In: NAACL (2018)

    Google Scholar 

  5. Cambria, E., White, B.: Jumping NLP Curves: a review of natural language processing research. IEEE Comput. Intell. Mag. 9(2), 48–57 (2014). https://doi.org/10.1109/MCI.2014.2307227

    Article  Google Scholar 

  6. Gómez Mont, C., Martinez Pinto, C.: La inteligencia artificial al servicio del bien social en América Latina y el Caribe: Panorámica regional e instantáneas de doce países

    Google Scholar 

  7. Ekanayaka, S.: Combining institutional repositories and artificial intelligence: AI in Academia is Poised to Induce an Unfaltering Growth Stance in Research and Innovation. Research Information, pp. 40–41 (2020)

    Google Scholar 

  8. Fricke, S.: Semantic scholar. J. Med. Libr. Assoc. JMLA 106(1), 145 (2018)

    Google Scholar 

  9. Fruchterman, T.M., Reingold, E.M.: Graph drawing by force-directed placement. Softw. Pract. Exper. 21(11), 1129–1164 (1991)

    Google Scholar 

  10. Ghazi, A.N., Petersen, K., Reddy, S.S.V.R., Nekkanti, H.: Survey research in software engineering: problems and mitigation strategies. IEEE Access 7, 24703–24718 (2018)

    Article  Google Scholar 

  11. Han, E.-H.S., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45372-5_46

    Chapter  Google Scholar 

  12. Irvall, B., Nielsen, G.S.: Access to libraries for persons with disabilities: checklist. IFLA Professional Reports, No. 89. International Federation of Library Associations and Institutions (2005). https://eric.ed.gov/?id=ED494537 iSSN: 0168-1931 Publication Title: International Federation of Library Associations and Institutions (NJ1)

  13. Kurian, S.K., Mathew, S.: Survey of scientific document summarization methods. Comput. Sci. 21, 3356 (2020)

    Google Scholar 

  14. Lee, M.D., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 27 (2005)

    Google Scholar 

  15. Mair, C., et al.: An investigation of machine learning based prediction systems. J. Syst. Softw. 53(1), 23–29 (2000). https://doi.org/10.1016/S0164-1212(00)00005-4. https://www.sciencedirect.com/science/article/pii/S0164121200000054

  16. Mayr, P., et al.: Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL). Int. J. Digit. Libr. 19(2), 107–111 (2018)

    Article  Google Scholar 

  17. McKiernan, G.: arXiv. org: The los alamos national laboratory e-print server. Int. J. Grey Literat. 1(3), 127–138 (2000)

  18. Medin, D.L., Goldstone, R.L., Gentner, D.: Respects for similarity. Psychol. Rev. 100(2), 254–278 (1993). https://doi.org/10.1037/0033-295X.100.2.254. http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-295X.100.2.254

  19. Mohammed, A.J., Yusof, Y., Husni, H.: Document clustering for knowledge discovery using nature-inspired algorithm (2014)

    Google Scholar 

  20. Pazmiño-Maji, R., Naranjo-Ordoñez, L., Conde-González, M., García-Peñalvo, F.: Learning analytics in Ecuador: an initial analysis based in a mapping review. In: Proceedings of the Seventh International Conference on Technological Ecosystems for Enhancing Multiculturality, pp. 304–311 (2019)

    Google Scholar 

  21. Saltos, W.R.F., Barcenes, V.A.B., Benavides, J.P.C.: Una mirada a los repositorios digitales en ecuador. RECIAMUC 2(1), 836–863 (2018)

    Google Scholar 

  22. Sánchez, D., Martínez-Sanahuja, L., Batet, M.: Survey and evaluation of web search engine hit counts as research tools in computational linguistics. Inf. Syst. 73, 50–60 (2018)

    Article  Google Scholar 

  23. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  24. Sintoris, K., Vergidis, K.: Extracting business process models using natural language processing (NLP) techniques. In: 2017 IEEE 19th Conference on Business Informatics (CBI), vol. 01, pp. 135–139 (2017)

    Google Scholar 

  25. Sosin, A., et al.: How to increase the information assurance in the information age. J. Defense Resour. Manage. (JoDRM) 9(1), 45–57 (2018)

    Google Scholar 

  26. Sumba, F.: Red de repositorios de acceso abierto del ecuador-rraae. In: X Conferencia Internacional de Bibliotecas y Repositorios Digitales (BIREDIAL-ISTEC) (Modalidad virtual, 25 al 29 de octubre de 2021) (2021)

    Google Scholar 

  27. Suryakant, Mahara, T.: A new similarity measure based on mean measure of divergence for collaborative filtering in sparse environment. Procedia Comput. Sci. 89, 450–456 (2016). https://doi.org/10.1016/j.procs.2016.06.099. https://www.sciencedirect.com/science/article/pii/S1877050916311644

  28. Tonon, L., Fusco, E.: Data mining as a tool for information retrieval in digital institutional repositories. Proceed. CSSS 2014, 180–183 (2014)

    Google Scholar 

  29. Vallejo-Huanga, D., Morillo, P., Ferri, C.: Semi-supervised clustering algorithms for grouping scientific articles. Procedia Comput. Sci. 108, 325–334 (2017)

    Article  Google Scholar 

  30. Van Rossum, G., et al.: Python programming language. In: USENIX Annual Technical Conference, vol. 41, pp. 1–36. Santa Clara, CA (2007)

    Google Scholar 

  31. Vijayarani, S., Muthulakshmi, M.: Comparative analysis of Bayes and lazy classification algorithms. Int. J. Adv. Res. Comput. Commun. Eng. 2(8), 3118–3124 (2013)

    Google Scholar 

  32. Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text mining: predictive methods for analyzing unstructured information. Springer Science & Business Media (2010). https://doi.org/10.1007/978-0-387-34555-0

  33. White, J.: Pubmed 2.0. Med. Ref. Serv. Quart. 39(4), 382–387 (2020)

    Google Scholar 

  34. Yeh, A.S., Hirschman, L., Morgan, A.A.: Evaluation of text data mining for database curation: lessons learned from the KDD challenge cup. Bioinform. 19(suppl_1), 331–339 (2003)

    Google Scholar 

  35. Yue, X., Di, G., Yu, Y., Wang, W., Shi, H.: Analysis of the combination of natural language processing and search engine technology. Procedia Eng. 29, 1636–1639 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego Vallejo-Huanga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vallejo-Huanga, D., Jaime, J., Andrade, C. (2023). Similarity Visualizer Using Natural Language Processing in Academic Documents of the DSpace in Ecuador. In: Sserwanga, I., et al. Information for a Better World: Normality, Virtuality, Physicality, Inclusivity. iConference 2023. Lecture Notes in Computer Science, vol 13972. Springer, Cham. https://doi.org/10.1007/978-3-031-28032-0_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28032-0_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28031-3

  • Online ISBN: 978-3-031-28032-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics