Skip to main content

Towards the Creation of AI-powered Queries Using Transfer Learning on NLP Model - The THESPIAN-NER Experience

  • Conference paper
  • First Online:
Image Analysis and Processing. ICIAP 2022 Workshops (ICIAP 2022)

Abstract

Tools for HEritage Science Processing, Integration, and ANalysis (THESPIAN) is a cloud system that offers multiple web services to the researchers of INFN-CHNet, from storing their raw data to reusing them by following the FAIR principles for establishing integration and interoperability among shared information.

The injection in the CHNet cloud database of data and metadata (the latter modelled on a CIDOC-based ontology called CRMhs [20]) is performed by using the cloud service THESPIAN-Mask.

THESPIAN-NER is a tool based on a deep neural network for Named Entity Recognition (NER), which will ease the data extraction from the database, enabling users to upload .pdf or .txt files and obtain named entities and keywords to be fetched in the metadata entries of the database.

The neural network, on which THESPIAN-NER relies, is based on a set of open-source NLP models; transfer learning was employed to customise the Named Entity Recognition output of the models to match the CRMhs ontology properties.

The service is now available in alpha version to researchers on the CHNet cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For some references about the usage of NER in Cultural Heritage applications, see [8, 13, 15, 19] and references therein.

  2. 2.

    Using the open-source add-on library displacy, is possible to render the annotated text as HTML, where named entities are highlighted using a HEX colour. Such hex colour is employed in the web application as a visual aid to users, in order to visually discriminate entities among NER labels.

  3. 3.

    We recall that

    $$\begin{aligned} \mathrm {P} = \frac{TP}{TP+FP} \,, \quad \mathrm {R} = \frac{TP}{TP+FN} \,, \quad \mathrm {F} = 2 \, \frac{\mathrm {P} \cdot \mathrm {R}}{\mathrm {P} + \mathrm {R}} \,, \end{aligned}$$
    (1)

    where TP are the true positive counts, FP are false positive counts, and FN are false negative counts.

  4. 4.

    Using PyPDF2 for the .pdf parsing.

References

  1. Bekiari, C., Bruseker, G., Doerr, M., Oreand, C.E., Stead, S., Velios, A.: CIDOC CRM. International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). Version 7.1.1. https://doi.org/10.26225/FDZH-X261

  2. Bombini, A., et al.: CHNet cloud: an EOSC-based cloud for physical technologies applied to cultural heritages. In: GARR (ed.) Conferenza GARR 2021 - Sostenibile/Digitale. Dati e tecnologie per il futuro, vol. selected papers. Associazione Consortium GARR, 10.26314/GARR-Conf21-proceedings-09 (2021). https://doi.org/10.26314/GARR-Conf21-proceedings-09

  3. Bosco, C., Lenci, A., Montemagni, S., Simi, M.: Universal dependencies 2.9 - italian corpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University (2021). http://hdl.handle.net/11234/1-4611,

  4. Bosco, C., Montemagni, S., Simi, M.: Converting Italian treebanks: Towards an Italian Stanford dependency treebank. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, pp. 61–69. Association for Computational Linguistics, Aug 2013. https://aclanthology.org/W13-2308

  5. Castelli, L., Felicetti, A., Proietti, F.: Heritage science and cultural heritage: standards and tools for establishing cross-domain data interoperability (2019). https://doi.org/10.1007/s00799-019-00275-2

  6. Ceccanti, A., Vianello, E., Caberletti, M., Giacomini, F.: Beyond x.509: token-based authentication and authorization for hep. EPJ Web Conf. 214, 09002 (2019). https://doi.org/10.1051/epjconf/201921409002

  7. Chiari, M., et al.: LABEC, the INFN ion beam laboratory of nuclear techniques for environment and cultural heritage. Eur. Phys. J. Plus 136(4), 472 (2021). https://doi.org/10.1140/epjp/s13360-021-01411-1

  8. van Dalen-Oskam, K., et al.: Named entity recognition and resolution for literary studies. In: CLIN 2014 (2014)

    Google Scholar 

  9. DataCloud-Collaboration: INDIGO-DataCloud: A data and computing platform to facilitate seamless access to e-infrastructures

    Google Scholar 

  10. Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1373–1378. Association for Computational Linguistics, Sept 2015. https://aclweb.org/anthology/D/D15/D15-1162

  11. Honnibal, M., et al.: Explosion/spaCy: v2.1.7: Improved evaluation, better language factories and bug fixes, Aug 2019. https://doi.org/10.5281/zenodo.3358113

  12. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength Natural Language Processing in Python (2020). https://doi.org/10.5281/zenodo.1212303

  13. van Hooland, S., De Wilde, M., Verborgh, R., Steiner, T., Van de Walle, R.: Exploring entity recognition and disambiguation for cultural heritage collections. Digital Sch. Humanit. 30(2), 262–279 (2013). https://doi.org/10.1093/llc/fqt067

  14. (IETF): The OAuth 2.0 Authorization Framework (2012). https://datatracker.ietf.org/doc/html/rfc6749

  15. Jain, N., Krestel, R.: Who is mona L.? identifying mentions of artworks in historical archives. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 115–122. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_10

    Chapter  Google Scholar 

  16. Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9. Association for Computational Linguistics, June 2018. http://tubiblio.ulb.tu-darmstadt.de/106270/

  17. Mesnil, G., et al.: Unsupervised and transfer learning challenge: a deep learning approach. In: Guyon, I., Dror, G., Lemaire, V., Taylor, G., Silver, D. (eds.) Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Proceedings of Machine Learning Research, Bellevue, Washington, USA, vol. 27, pp. 97–110. PMLR, 02 July 2012. https://proceedings.mlr.press/v27/mesnil12a.html

  18. Montani, I., et al.: explosion/spaCy: v3.1.4: Python 3.10 wheels and support for AppleOps, Oct 2021. https://doi.org/10.5281/zenodo.5617894

  19. Mosallam, Y., Abi-Haidar, A., Ganascia, J.-G.: Unsupervised named entity recognition and disambiguation: an application to old French journals. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 12–23. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08976-8_2

    Chapter  Google Scholar 

  20. Niccolucci, F., Felicetti, A.: A cidoc CRM-based model for the documentation of heritage sciences. In: 2018 3rd Digital Heritage International Congress (DigitalHERITAGE) held jointly with 2018 24th International Conference on Virtual Systems Multimedia (VSMM 2018), pp. 1–6 (2018). https://doi.org/10.1109/DigitalHeritage.2018.8810109

  21. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al.: The fair guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.1

Download references

Acknowledgments

The present research has been partially funded by the European Commission within the Framework Programme Horizon 2020 with the projects ARIADNEplus (GA no. H2020-INFRAIA-01-2018-2019-823914) and EOSC-Pillar (GA no. H2020-INFRAEOSC-05-2018-2019-857650).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessandro Bombini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bombini, A., Castelli, L., Felicetti, A., Niccolucci, F., Reccia, A., Taccetti, F. (2022). Towards the Creation of AI-powered Queries Using Transfer Learning on NLP Model - The THESPIAN-NER Experience. In: Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C. (eds) Image Analysis and Processing. ICIAP 2022 Workshops. ICIAP 2022. Lecture Notes in Computer Science, vol 13374. Springer, Cham. https://doi.org/10.1007/978-3-031-13324-4_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13324-4_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13323-7

  • Online ISBN: 978-3-031-13324-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics