Abstract
Reliably annotated corpora with reliable annotation are a valuable resource for Natural Language Processing, which justifies the search for methods capable of assisting linguistic revision. In this context, we present a study on methods for revising dependency treebanks, investigating the contribution of three different strategies to the corpus review: (i) linguistic rules; (ii) an adaptation of the n-grams method proposed by Boyd et al. (2008) applied to Portuguese; and (iii) Inter-Annotator Disagreement, a linguistically motivated approach that draws inspiration from the human annotation process. The results are promising, and taken together the three methods can lead to the revision of up to 58% of the errors in a specific corpus at the cost of revising only 20% of the corpus. We also present a tool that integrates treebank editing, evaluation and search capabilities with the review methods, as well as a gold-standard Portuguese corpus from the oil and gas domain.
Similar content being viewed by others
Notes
Besides, as noted by Baker (1997), even fully human annotation is susceptible to error and inconsistency.
As noted by a reviewer, consistency and correctness are distinct but related phenomena. In the context of linguist annotation and IAA, high consistency (measured by high inter-annotator agreement rates) is taken as an indicator—but not a guarantee—of high quality annotation, as consistently poorly annotated text would also lead to a high rates of IAA. On the other hand, two (or more) identical segments can be annotated in an “inconsistent” way, that is, differently, without this difference being an error, if they perform different functions in the context in which they are inserted (Fig. 2).
Numbers from Petro1, following the trend found in de Marneffe et al. (2017) for English and French.
These rules can be found at https://github.com/alvelvis/ACDC-UD/blob/master/validar_UD.txt and are written in Python syntax.
The dependency relation “det” should be used to tag relations between determiners and their heads.
As mentioned by a reviewer, although the correlation between consistency and correctness underlies the use of inter-annotator agreement as a measure of annotation quality, this assumption should be viewed with caution, as we report in Sect. 6.
According to Zeman et al. (2018), the performance of UDPipe v1.2 for Portuguese using LAS is 82.07% while Stanza achieved a 87.81% score for the same dataset.
The need for two parsers is justified by the automatic correction method proposed by the authors, in which three different parsers are used.
Julgamento is open source and can be downloaded from https://github.com/alvelvis/Julgamento.
Available at https://github.com/alvelvis/conllu-merge-resolver.
This error can be seen in two ways: deprel is wrong because dephead is wrong, or dephead is wrong because deprel is wrong. As the revision begins with the simplified CM, we started with deprel.
Since we were concerned with creating the best parsing model for Petro2, these changes were designed to decrease the number of unseen words and structures, possibly decreasing the amount of revision required, even though the addition to the training data was too small—Petro1 represents less than 7% of the training data.
Even to annotate Petro2, to whoose training data we added Petro1, the vast majority of the data (93%) still came from Bosque-UD.
References
Afonso, S., Bick, E., Haber, R., & Santos, D. (2002). Floresta sintá (c) tica: a treebank for portuguese. Proceedings of the third international conference on language resources and evaluation (LREC 2002). ELRA.
Artstein, R. (2017). Inter-annotator agreement. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 297–313). Springer.
Baker, J. P. (1997). Consistency and accuracy in correcting automatically tagged data. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 243–250). Longman.
Blaheta, D. (2002). Handling noisy training and testing data. Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002) (pp. 111–116). EMNLP.
Boyd, A., Dickinson, M., & Meurers, W. D. (2008). On detecting errors in dependency treebanks. Research on Language and Computation, 6(2), 113–137.
Consoli, B., Santos, J., Gomes, D., Cordeiro, F., Vieira, R., & Moreira, V. (2020). Embeddings for named entity recognition in geoscience Portuguese literature. Proceedings of the 12th language resources and evaluation conference (pp. 4625–4630). European Language Resources Association.
de Eckart Castilho, R., Mújdricza-Maydt, E., Yimam, S. M., Hartmann, S., Gurevych, I., Frank, A., & Biemann, C. (2016). A web-based tool for the integrated annotation of semantic and syntactic structures. Proceedings of the workshop on language technology resources and tools for digital humanities (LT4DH) (pp. 74–86). The COLING 2016 Organizing Committee.
de Marneffe, M.C., Grioni, M., Kanerva, J., & Ginter, F. (2017). Assessing the annotation consistency of the Universal Dependencies corpora. Proceedings of the fourth iternational conference on dependency linguistics (Depling 2017) (pp. 108–115). Linköping University Electronic Press. https://www.aclweb.org/anthology/W17-6514
de Souza, E., & Freitas, C. (2022). Polishing the gold—how much revision do we need in treebanks? In T. Pardo, A. Felippo, N. Roman (Eds.), Universal dependencies Brazilian festival—proceedings of the conference. SBC.
Dickinson, M. (2015). Detection of annotation errors in corpora. Language and Linguistics Compass, 9(3), 119–138.
Dickinson, M., & Meurers, D. (2003a). Detecting errors in part-of-speech annotation. 10th conference of the European chapter of the association for computational linguistics. The Ohio State University.
Dickinson, M., & Meurers, W. D. (2003b). Detecting inconsistencies in treebanks. Proceedings of TLT (pp. 45–56). The Ohio State University.
Freitas, C., Rocha, P., & Bick, E. (2008). Floresta Sintá(c)tica: Bigger, Thicker and Easier. In: A. Teixeira, V.L.S. de Lima, L.C. de Oliveira, P. Quaresma (Eds.), Computational processing of the portuguese language, 8th international conference, proceedings (PROPOR 2008) (pp. 216–219). Springer.
Gerdes, K. (2013). Collaborative dependency annotation. Proceedings of the second international conference on dependency linguistics (DepLing 2013) (pp. 88–97).
Manning, C.D. (2011). Part-of-speech tagging from 97% to 100%: is it time for some linguistics? International conference on intelligent text processing and computational linguistics (pp. 171–189). Springer.
Nivre, J., De Marneffe, M.C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Reut, T., & Daniel, Z. (2016). Universal dependencies v1: A multilingual treebank collection. Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 1659–1666).
Nivre, J., & Fang, C.T. (2017). Universal dependency evaluation. Proceedings of the NoDaLiDa 2017 workshop on universal dependencies (UDW 2017) (pp. 86–95).
Oliva, K. (2001). The possibilities of automatic detection/correction of errors in tagged corpora: A pilot study on a german corpus. International conference on text, speech and dialogue (pp. 39–46). Springer.
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C.D. (2020). Stanza: A python natural language processing toolkit for many human languages. Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations (pp. 101–108). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-demos.14.
Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., & de Paiva, V. (2017). Universal dependencies for portuguese. Proceedings of the fourth international conference on dependency linguistics (Depling 2017) (pp. 197–206)
Schneider, G., & Volk, M. (1998). Comparing a statistical and a rule-based tagger for german. KONVENS-98
Straka, M., Hajic, J., & Straková, J. (2016). Udpipe: trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 4290–4297). LRCE.
van Halteren, H. (2000). The detection of inconsistency in manually tagged text. Proceedings of the COLING-2000 workshop on linguistically interpreted corpora (pp. 48–55). International Committee on Computational Linguistics, Centre Universitaire. https://aclanthology.org/W00-1907
Volokh, A., & Neumann, G. (2011). Automatic detection and correction of errors in dependency treebanks. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 346–350). Association for Computational Linguistics. https://aclanthology.org/P11-2060
Wallis, S. (2003). Completing parsed corpora. Treebanks (pp. 61–71). Springer.
Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). Conll 2018 shared task: multilingual parsing from raw text to universal dependencies. Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies (pp. 1–21)
Acknowledgements
This study was partially funded by the National Agency for Petroleum, Natural Gas and Biofuels (ANP), Brazil, associated with the investment of resources from the R, D & I Clauses, through a Cooperation Agreement between Petrobras and PUC-Rio. We would like to thank the team at the Applied Computational Intelligence Laboratory (ICA) at PUC-Rio for the generation of morphosyntactic annotation models trained in Stanza, and Elvis de Souza thanks the National Council for Scientific and Technological Development (CNPq) for the Masters scholarship process no. 130495/2021-2.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Freitas, C., de Souza, E. A study on methods for revising dependency treebanks: in search of gold. Lang Resources & Evaluation 58, 111–131 (2024). https://doi.org/10.1007/s10579-023-09653-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-023-09653-4