Abstract
In this paper, we report on our experiments towards multilingual discourse connective (or DC) identification and show how language specific BERT models seem to be sufficient even with little task-specific training data. While some languages have large corpora with human annotated DCs, most languages are low in such resources. Hence, relying solely on discourse annotated corpora to train a DC identification system for low resourced languages is insufficient. To address this issue, we developed a model based on pretrained BERT and fine-tuned it with discourse annotated data of varying sizes. To measure the effect of larger training data, we induced synthetic training corpora with DC annotations using word-aligned parallel corpora. We evaluated our models on 3 languages: English, Turkish and Mandarin Chinese in the context of the recent DISRPT 2021 Task 2 shared task. Results show that the F-measure achieved by the standard BERT model (92.49%, 93.97%, 87.42% for English, Turkish and Chinese) is hard to improve upon even with larger task specific training corpora
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Using the official DISRPT 2021 scorer available at https://github.com/disrpt/sharedtask2021.
References
Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)
Bentivogli, L., Pianta, E.: Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus. Nat. Lang. Eng. 11(3), 247–261 (2005)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Gessler, L., Behzad, S., Liu, Y.J., Peng, S., Zhu, Y., Zeldes, A.: Discodisco at the DISRPT2021 shared task: a system for discourse segmentation, classification, and connective detection. CoRR abs/2109.09777 (2021). https://arxiv.org/abs/2109.09777
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python (2020). https://spacy.io/
Jalili Sabet, M., Dufter, P., Yvon, F., Schütze, H.: SimAlign: high quality word alignments without parallel training data using static and contextualized embeddings. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, (EMNLP 2020), Punta Cana, Dominican Republic, pp. 1627–1643, November 2020. https://www.aclweb.org/anthology/2020.findings-emnlp.147
Johannsen, A., Søgaard, A.: Disambiguating explicit discourse connectives without oracles. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, (IJCNLP 2013), Nagoya, Japan, pp. 997–1001, October 2013. https://aclanthology.org/I13-1134
Laali, M.: Inducing discourse resources using annotation projection, Ph.D. thesis, Concordia University, November 2017. https://spectrum.library.concordia.ca/983791/
Laali, M., Kosseim, L.: Improving discourse relation projection to build discourse annotated corpora. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, (RANLP 2017), Varna, Bulgaria, pp. 407–416, September 2017. https://doi.org/10.26615/978-954-452-049-6_0_54
Liu, Y., Liu, Q., Lin, S.: Log-linear models for word alignment. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, Michigan, pp. 459–466, June 2005. https://doi.org/10.3115/1219840.1219897. https://aclanthology.org/P05-1057
Liu, Y., Sun, M.: Contrastive unsupervised word alignment with non-local features. In: Proceedings of the Twenty-Ninth Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI 2015), pp. 2295–2301 (2015). http://arxiv.org/abs/1410.2082
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: a framework for the analysis of texts. IPrA Papers Pragmatics 1, 79–105 (1987)
Muller, P., Braud, C., Morey, M.: ToNy: contextual embeddings for accurate multilingual discourse segmentation of full documents. In: Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, Minneapolis, MN, pp. 115–124, June 2019. https://doi.org/10.18653/v1/W19-2715. https://aclanthology.org/W19-2715
Pitler, E., Nenkova, A.: Using syntax to disambiguate explicit discourse connectives in text. In: Proceedings of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP 2009), Suntec, Singapore, pp. 13–16, August 2009. https://aclanthology.org/P09-2004
Prasad, R., et al.: The Penn discourse TreeBank 2.0. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 2961–2968, May 2008. http://www.lrec-conf.org/proceedings/lrec2008/pdf/754_paper.pdf
Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 2214–2218, May 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Tiedemann, J.: Improving the cross-lingual projection of syntactic dependencies. In: Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), Vilnius, Lithuania, pp. 191–199, May 2015. https://aclanthology.org/W15-1824
Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish. CoRR abs/1912.07076 (2019). http://arxiv.org/abs/1912.07076
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771 (2019). http://arxiv.org/abs/1910.03771
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (HLT 2001), San Diego, California, pp. 1–8, March 2001. https://aclanthology.org/H01-1035
Zeldes, A., Liu, J.: DISRPT 2021 task 2 results (2021). https://sites.google.com/georgetown.edu/disrpt2021/results#h.gb445xshqmt7. https://sites.google.com/georgetown.edu/disrpt2021/results
Zeyrek, D., Kurfalı, M.: TDB 1.1: extensions on Turkish discourse bank. In: Proceedings of the 11th Linguistic Annotation Workshop, Valencia, Spain, pp. 76–81, April 2017. https://doi.org/10.18653/v1/W17-0809. https://aclanthology.org/W17-0809
Zeyrek, D., Kurfalı, M.: An assessment of explicit inter- and intra-sentential discourse connectives in Turkish discourse bank. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, pp. 4023–4029, May 2018. https://aclanthology.org/L18-1634
Zhou, L., Gao, W., Li, B., Wei, Z., Wong, K.F.: Cross-lingual identification of Ambiguous discourse Connectives for resource-poor Language. In: Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers (COLING 2012), Mumbai, pp. 1409–1418, December 2012. https://aclanthology.org/C12-2138
Acknowledgment
The authors would like to thank the anonymous reviewers for their valuable comments on an earlier version of this paper. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chapados Muermans, T., Kosseim, L. (2022). A BERT-Based Approach for Multilingual Discourse Connective Detection. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-08473-7_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08472-0
Online ISBN: 978-3-031-08473-7
eBook Packages: Computer ScienceComputer Science (R0)