Towards the Creation of AI-powered Queries Using Transfer Learning on NLP Model - The THESPIAN-NER Experience

Bombini, Alessandro; Castelli, Lisa; Felicetti, Achille; Niccolucci, Franco; Reccia, Anna; Taccetti, Francesco

doi:10.1007/978-3-031-13324-4_23

Alessandro Bombini¹¹,
Lisa Castelli¹¹,
Achille Felicetti¹²,
Franco Niccolucci¹²,
Anna Reccia¹² &
…
Francesco Taccetti¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13374))

Included in the following conference series:

International Conference on Image Analysis and Processing

Abstract

Tools for HEritage Science Processing, Integration, and ANalysis (THESPIAN) is a cloud system that offers multiple web services to the researchers of INFN-CHNet, from storing their raw data to reusing them by following the FAIR principles for establishing integration and interoperability among shared information.

The injection in the CHNet cloud database of data and metadata (the latter modelled on a CIDOC-based ontology called CRMhs [20]) is performed by using the cloud service THESPIAN-Mask.

THESPIAN-NER is a tool based on a deep neural network for Named Entity Recognition (NER), which will ease the data extraction from the database, enabling users to upload .pdf or .txt files and obtain named entities and keywords to be fetched in the metadata entries of the database.

The neural network, on which THESPIAN-NER relies, is based on a set of open-source NLP models; transfer learning was employed to customise the Named Entity Recognition output of the models to match the CRMhs ontology properties.

The service is now available in alpha version to researchers on the CHNet cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Analyzing Research Trends in Inorganic Materials Literature Using NLP

Renaissance of Fuzzy and Fast Matching Entity with DSHS Algorithm

Article 27 July 2024

Spiking Equilibrium Convolutional Neural Network for Spatial Urban Ontology

Article 12 May 2023

Notes

1.
For some references about the usage of NER in Cultural Heritage applications, see [8, 13, 15, 19] and references therein.
2.
Using the open-source add-on library displacy, is possible to render the annotated text as HTML, where named entities are highlighted using a HEX colour. Such hex colour is employed in the web application as a visual aid to users, in order to visually discriminate entities among NER labels.
3.
We recall that
$$\begin{aligned} \mathrm {P} = \frac{TP}{TP+FP} \,, \quad \mathrm {R} = \frac{TP}{TP+FN} \,, \quad \mathrm {F} = 2 \, \frac{\mathrm {P} \cdot \mathrm {R}}{\mathrm {P} + \mathrm {R}} \,, \end{aligned}$$
(1)
where TP are the true positive counts, FP are false positive counts, and FN are false negative counts.
4.
Using PyPDF2 for the .pdf parsing.

References

Bekiari, C., Bruseker, G., Doerr, M., Oreand, C.E., Stead, S., Velios, A.: CIDOC CRM. International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). Version 7.1.1. https://doi.org/10.26225/FDZH-X261
Bombini, A., et al.: CHNet cloud: an EOSC-based cloud for physical technologies applied to cultural heritages. In: GARR (ed.) Conferenza GARR 2021 - Sostenibile/Digitale. Dati e tecnologie per il futuro, vol. selected papers. Associazione Consortium GARR, 10.26314/GARR-Conf21-proceedings-09 (2021). https://doi.org/10.26314/GARR-Conf21-proceedings-09
Bosco, C., Lenci, A., Montemagni, S., Simi, M.: Universal dependencies 2.9 - italian corpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University (2021). http://hdl.handle.net/11234/1-4611,
Bosco, C., Montemagni, S., Simi, M.: Converting Italian treebanks: Towards an Italian Stanford dependency treebank. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, pp. 61–69. Association for Computational Linguistics, Aug 2013. https://aclanthology.org/W13-2308
Castelli, L., Felicetti, A., Proietti, F.: Heritage science and cultural heritage: standards and tools for establishing cross-domain data interoperability (2019). https://doi.org/10.1007/s00799-019-00275-2
Ceccanti, A., Vianello, E., Caberletti, M., Giacomini, F.: Beyond x.509: token-based authentication and authorization for hep. EPJ Web Conf. 214, 09002 (2019). https://doi.org/10.1051/epjconf/201921409002
Chiari, M., et al.: LABEC, the INFN ion beam laboratory of nuclear techniques for environment and cultural heritage. Eur. Phys. J. Plus 136(4), 472 (2021). https://doi.org/10.1140/epjp/s13360-021-01411-1
van Dalen-Oskam, K., et al.: Named entity recognition and resolution for literary studies. In: CLIN 2014 (2014)
Google Scholar
DataCloud-Collaboration: INDIGO-DataCloud: A data and computing platform to facilitate seamless access to e-infrastructures
Google Scholar
Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1373–1378. Association for Computational Linguistics, Sept 2015. https://aclweb.org/anthology/D/D15/D15-1162
Honnibal, M., et al.: Explosion/spaCy: v2.1.7: Improved evaluation, better language factories and bug fixes, Aug 2019. https://doi.org/10.5281/zenodo.3358113
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength Natural Language Processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
van Hooland, S., De Wilde, M., Verborgh, R., Steiner, T., Van de Walle, R.: Exploring entity recognition and disambiguation for cultural heritage collections. Digital Sch. Humanit. 30(2), 262–279 (2013). https://doi.org/10.1093/llc/fqt067
(IETF): The OAuth 2.0 Authorization Framework (2012). https://datatracker.ietf.org/doc/html/rfc6749
Jain, N., Krestel, R.: Who is mona L.? identifying mentions of artworks in historical archives. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 115–122. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_10
Chapter Google Scholar
Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9. Association for Computational Linguistics, June 2018. http://tubiblio.ulb.tu-darmstadt.de/106270/
Mesnil, G., et al.: Unsupervised and transfer learning challenge: a deep learning approach. In: Guyon, I., Dror, G., Lemaire, V., Taylor, G., Silver, D. (eds.) Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Proceedings of Machine Learning Research, Bellevue, Washington, USA, vol. 27, pp. 97–110. PMLR, 02 July 2012. https://proceedings.mlr.press/v27/mesnil12a.html
Montani, I., et al.: explosion/spaCy: v3.1.4: Python 3.10 wheels and support for AppleOps, Oct 2021. https://doi.org/10.5281/zenodo.5617894
Mosallam, Y., Abi-Haidar, A., Ganascia, J.-G.: Unsupervised named entity recognition and disambiguation: an application to old French journals. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 12–23. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08976-8_2
Chapter Google Scholar
Niccolucci, F., Felicetti, A.: A cidoc CRM-based model for the documentation of heritage sciences. In: 2018 3rd Digital Heritage International Congress (DigitalHERITAGE) held jointly with 2018 24th International Conference on Virtual Systems Multimedia (VSMM 2018), pp. 1–6 (2018). https://doi.org/10.1109/DigitalHeritage.2018.8810109
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al.: The fair guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.1

Download references

Acknowledgments

The present research has been partially funded by the European Commission within the Framework Programme Horizon 2020 with the projects ARIADNEplus (GA no. H2020-INFRAIA-01-2018-2019-823914) and EOSC-Pillar (GA no. H2020-INFRAEOSC-05-2018-2019-857650).

Author information

Authors and Affiliations

INFN Florence Section, Via Bruno Rossi 1, Florence, Italy
Alessandro Bombini, Lisa Castelli & Francesco Taccetti
PIN, Prato, Italy
Achille Felicetti, Franco Niccolucci & Anna Reccia

Authors

Alessandro Bombini
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Castelli
View author publications
You can also search for this author in PubMed Google Scholar
Achille Felicetti
View author publications
You can also search for this author in PubMed Google Scholar
Franco Niccolucci
View author publications
You can also search for this author in PubMed Google Scholar
Anna Reccia
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Taccetti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessandro Bombini .

Editor information

Editors and Affiliations

National Research Council, Lecce, Italy
Pier Luigi Mazzeo
Università Politecnica delle Marche, Ancona, Italy
Emanuele Frontoni
Boston University, Boston, MA, USA
Stan Sclaroff
National Research Council, Lecce, Italy
Cosimo Distante

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bombini, A., Castelli, L., Felicetti, A., Niccolucci, F., Reccia, A., Taccetti, F. (2022). Towards the Creation of AI-powered Queries Using Transfer Learning on NLP Model - The THESPIAN-NER Experience. In: Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C. (eds) Image Analysis and Processing. ICIAP 2022 Workshops. ICIAP 2022. Lecture Notes in Computer Science, vol 13374. Springer, Cham. https://doi.org/10.1007/978-3-031-13324-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-13324-4_23
Published: 04 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13323-7
Online ISBN: 978-3-031-13324-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards the Creation of AI-powered Queries Using Transfer Learning on NLP Model - The THESPIAN-NER Experience