A Machine Learning Based Methodology for Automatic Annotation and Anonymisation of Privacy-Related Items in Textual Documents for Justice Domain

Di Martino, Beniamino; Marulli, Fiammetta; Lupi, Pietro; Cataldi, Alessandra

doi:10.1007/978-3-030-50454-0_55

A Machine Learning Based Methodology for Automatic Annotation and Anonymisation of Privacy-Related Items in Textual Documents for Justice Domain

Beniamino Di Martino^17,19,
Fiammetta Marulli^18,19,
Pietro Lupi²⁰ &
…
Alessandra Cataldi²⁰

Conference paper
First Online: 11 June 2020

1437 Accesses
10 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1194))

Abstract

Textual resources annotation is currently performed both manually by human experts selecting hand-crafted features and automatically by any trained systems. Human experts annotations are very accurate but require heavy effort in compilation and most often are not publicly accessible. Automatic approaches save efforts but don’t perform yet with the required accuracy, mostly because of the great difficulty and labor required to represent domain experts’ knowledge in a machine readable format. This work tackles the issue of automatically annotate plain text resources; it was motivated by the need of supporting Italian justice officers in detecting sensible information included in large amounts of judgements documents, for privacy preservation aims. We suggest a novel methodology, based on unsupervised machine learning techniques, to facilitate human experts in detecting sensible information. We performed experiments over about 20.000 plain text documents and we obtained an accuracy rate of about 75%, in the preliminary validation stage.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Di Martino, B.: An approach to semantic information retrieval based on natural language query understanding. In: Daniel, F., Facca, F.M. (eds.) Current Trends in Web Engineering. Lecture Notes in Computer Science, vol. 6385, pp. 211–222. Springer (2010)
Google Scholar
Honnibal, M., Montani, I.: spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear)
Google Scholar
Liao, X., Zhao, Z.: Unsupervised approaches for textual semantic annotation: a survey. ACM Comput. Surv. 52(4) (2019). https://doi.org/10.1145/3324473
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
Google Scholar
Marulli, F., Pota, M., Esposito, M.: A comparison of character and word embeddings in bidirectional LSTMs for POS Tagging in Italian. In: International Conference on Intelligent Interactive Multimedia Systems and Services, pp. 14–23. Springer (2018)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Moscato, F., Di Martino, B., Venticinque, S., Martone, A.: OVerFA: a collaborative framework for the semantic annotation of documents and websites. IJWGS - Int. J. Web Grid Serv. 5(1), 30–45 (2009)
Google Scholar
Palmero Aprosio, A., Moretti, G.: Italy goes to Stanford: a collection of CoreNLP modules for Italian. ArXiv e-prints, September 2016
Google Scholar
Patil, D., Mohapatra, R.K., Babu, K.S.: Evaluation of generalization based k-anonymization algorithms. In: 2017 Third International Conference on Sensing, Signal Processing and Security (ICSSS), pp. 171–175 (2017)
Google Scholar
Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470 (2019)

Download references

Acknowledgements

The study described in this work was performed and co-funded as a part of the research activities of the Applied Research Project “Big data Giustizia e Datawarehouse” promoted by the Italian Ministry of Justice and realized by Consorzio Interuniversitario Nazionale per l’Informatica (CINI).

Author information

Authors and Affiliations

Department of Engineering, Università della Campania “L.Vanvitelli”, Aversa, Italy
Beniamino Di Martino
Department of Maths and Physics, Università della Campania “L.Vanvitelli”, Caserta, Italy
Fiammetta Marulli
CINI - Consorzio Interuniversitario Nazionale per l’Informatica, Rome, Italy
Beniamino Di Martino & Fiammetta Marulli
Direzione Generale Sistemi Informativi Automatizzati, Ministero della Giustizia, Rome, Italy
Pietro Lupi & Alessandra Cataldi

Authors

Beniamino Di Martino
View author publications
You can also search for this author in PubMed Google Scholar
Fiammetta Marulli
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Lupi
View author publications
You can also search for this author in PubMed Google Scholar
Alessandra Cataldi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Beniamino Di Martino or Fiammetta Marulli .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Faculty of Information Engineering, Fukuoka Institute of Technology, Fukuoka, Japan
Leonard Barolli
Institute of Information Technology, Lodz University of Technology, Łódź, Poland
Aneta Poniszewska-Maranda
Faculty of Business Administration, Rissho University, Tokyo, Japan
Tomoya Enokido

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Di Martino, B., Marulli, F., Lupi, P., Cataldi, A. (2021). A Machine Learning Based Methodology for Automatic Annotation and Anonymisation of Privacy-Related Items in Textual Documents for Justice Domain. In: Barolli, L., Poniszewska-Maranda, A., Enokido, T. (eds) Complex, Intelligent and Software Intensive Systems. CISIS 2020. Advances in Intelligent Systems and Computing, vol 1194. Springer, Cham. https://doi.org/10.1007/978-3-030-50454-0_55

Download citation

DOI: https://doi.org/10.1007/978-3-030-50454-0_55
Published: 11 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50453-3
Online ISBN: 978-3-030-50454-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics