Abstract
Tools for HEritage Science Processing, Integration, and ANalysis (THESPIAN) is a cloud system that offers multiple web services to the researchers of INFN-CHNet, from storing their raw data to reusing them by following the FAIR principles for establishing integration and interoperability among shared information.
The injection in the CHNet cloud database of data and metadata (the latter modelled on a CIDOC-based ontology called CRMhs [20]) is performed by using the cloud service THESPIAN-Mask.
THESPIAN-NER is a tool based on a deep neural network for Named Entity Recognition (NER), which will ease the data extraction from the database, enabling users to upload .pdf or .txt files and obtain named entities and keywords to be fetched in the metadata entries of the database.
The neural network, on which THESPIAN-NER relies, is based on a set of open-source NLP models; transfer learning was employed to customise the Named Entity Recognition output of the models to match the CRMhs ontology properties.
The service is now available in alpha version to researchers on the CHNet cloud.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Using the open-source add-on library displacy, is possible to render the annotated text as HTML, where named entities are highlighted using a HEX colour. Such hex colour is employed in the web application as a visual aid to users, in order to visually discriminate entities among NER labels.
- 3.
We recall that
$$\begin{aligned} \mathrm {P} = \frac{TP}{TP+FP} \,, \quad \mathrm {R} = \frac{TP}{TP+FN} \,, \quad \mathrm {F} = 2 \, \frac{\mathrm {P} \cdot \mathrm {R}}{\mathrm {P} + \mathrm {R}} \,, \end{aligned}$$(1)where TP are the true positive counts, FP are false positive counts, and FN are false negative counts.
- 4.
Using PyPDF2 for the .pdf parsing.
References
Bekiari, C., Bruseker, G., Doerr, M., Oreand, C.E., Stead, S., Velios, A.: CIDOC CRM. International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). Version 7.1.1. https://doi.org/10.26225/FDZH-X261
Bombini, A., et al.: CHNet cloud: an EOSC-based cloud for physical technologies applied to cultural heritages. In: GARR (ed.) Conferenza GARR 2021 - Sostenibile/Digitale. Dati e tecnologie per il futuro, vol. selected papers. Associazione Consortium GARR, 10.26314/GARR-Conf21-proceedings-09 (2021). https://doi.org/10.26314/GARR-Conf21-proceedings-09
Bosco, C., Lenci, A., Montemagni, S., Simi, M.: Universal dependencies 2.9 - italian corpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University (2021). http://hdl.handle.net/11234/1-4611,
Bosco, C., Montemagni, S., Simi, M.: Converting Italian treebanks: Towards an Italian Stanford dependency treebank. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, pp. 61–69. Association for Computational Linguistics, Aug 2013. https://aclanthology.org/W13-2308
Castelli, L., Felicetti, A., Proietti, F.: Heritage science and cultural heritage: standards and tools for establishing cross-domain data interoperability (2019). https://doi.org/10.1007/s00799-019-00275-2
Ceccanti, A., Vianello, E., Caberletti, M., Giacomini, F.: Beyond x.509: token-based authentication and authorization for hep. EPJ Web Conf. 214, 09002 (2019). https://doi.org/10.1051/epjconf/201921409002
Chiari, M., et al.: LABEC, the INFN ion beam laboratory of nuclear techniques for environment and cultural heritage. Eur. Phys. J. Plus 136(4), 472 (2021). https://doi.org/10.1140/epjp/s13360-021-01411-1
van Dalen-Oskam, K., et al.: Named entity recognition and resolution for literary studies. In: CLIN 2014 (2014)
DataCloud-Collaboration: INDIGO-DataCloud: A data and computing platform to facilitate seamless access to e-infrastructures
Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1373–1378. Association for Computational Linguistics, Sept 2015. https://aclweb.org/anthology/D/D15/D15-1162
Honnibal, M., et al.: Explosion/spaCy: v2.1.7: Improved evaluation, better language factories and bug fixes, Aug 2019. https://doi.org/10.5281/zenodo.3358113
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength Natural Language Processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
van Hooland, S., De Wilde, M., Verborgh, R., Steiner, T., Van de Walle, R.: Exploring entity recognition and disambiguation for cultural heritage collections. Digital Sch. Humanit. 30(2), 262–279 (2013). https://doi.org/10.1093/llc/fqt067
(IETF): The OAuth 2.0 Authorization Framework (2012). https://datatracker.ietf.org/doc/html/rfc6749
Jain, N., Krestel, R.: Who is mona L.? identifying mentions of artworks in historical archives. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 115–122. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_10
Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9. Association for Computational Linguistics, June 2018. http://tubiblio.ulb.tu-darmstadt.de/106270/
Mesnil, G., et al.: Unsupervised and transfer learning challenge: a deep learning approach. In: Guyon, I., Dror, G., Lemaire, V., Taylor, G., Silver, D. (eds.) Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Proceedings of Machine Learning Research, Bellevue, Washington, USA, vol. 27, pp. 97–110. PMLR, 02 July 2012. https://proceedings.mlr.press/v27/mesnil12a.html
Montani, I., et al.: explosion/spaCy: v3.1.4: Python 3.10 wheels and support for AppleOps, Oct 2021. https://doi.org/10.5281/zenodo.5617894
Mosallam, Y., Abi-Haidar, A., Ganascia, J.-G.: Unsupervised named entity recognition and disambiguation: an application to old French journals. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 12–23. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08976-8_2
Niccolucci, F., Felicetti, A.: A cidoc CRM-based model for the documentation of heritage sciences. In: 2018 3rd Digital Heritage International Congress (DigitalHERITAGE) held jointly with 2018 24th International Conference on Virtual Systems Multimedia (VSMM 2018), pp. 1–6 (2018). https://doi.org/10.1109/DigitalHeritage.2018.8810109
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al.: The fair guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.1
Acknowledgments
The present research has been partially funded by the European Commission within the Framework Programme Horizon 2020 with the projects ARIADNEplus (GA no. H2020-INFRAIA-01-2018-2019-823914) and EOSC-Pillar (GA no. H2020-INFRAEOSC-05-2018-2019-857650).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bombini, A., Castelli, L., Felicetti, A., Niccolucci, F., Reccia, A., Taccetti, F. (2022). Towards the Creation of AI-powered Queries Using Transfer Learning on NLP Model - The THESPIAN-NER Experience. In: Mazzeo, P.L., Frontoni, E., Sclaroff, S., Distante, C. (eds) Image Analysis and Processing. ICIAP 2022 Workshops. ICIAP 2022. Lecture Notes in Computer Science, vol 13374. Springer, Cham. https://doi.org/10.1007/978-3-031-13324-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-13324-4_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13323-7
Online ISBN: 978-3-031-13324-4
eBook Packages: Computer ScienceComputer Science (R0)