A BERT-Based Approach for Multilingual Discourse Connective Detection

Chapados Muermans, Thomas; Kosseim, Leila

doi:10.1007/978-3-031-08473-7_41

Thomas Chapados Muermans¹² &
Leila Kosseim¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13286))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1422 Accesses
1 Citations

Abstract

In this paper, we report on our experiments towards multilingual discourse connective (or DC) identification and show how language specific BERT models seem to be sufficient even with little task-specific training data. While some languages have large corpora with human annotated DCs, most languages are low in such resources. Hence, relying solely on discourse annotated corpora to train a DC identification system for low resourced languages is insufficient. To address this issue, we developed a model based on pretrained BERT and fine-tuned it with discourse annotated data of varying sizes. To measure the effect of larger training data, we induced synthetic training corpora with DC annotations using word-aligned parallel corpora. We evaluated our models on 3 languages: English, Turkish and Mandarin Chinese in the context of the recent DISRPT 2021 Task 2 shared task. Results show that the F-measure achieved by the standard BERT model (92.49%, 93.97%, 87.42% for English, Turkish and Chinese) is hard to improve upon even with larger task specific training corpora

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://sites.google.com/georgetown.edu/disrpt2021/home.
2.
https://sites.google.com/georgetown.edu/disrpt2021/home.
3.
Using the official DISRPT 2021 scorer available at https://github.com/disrpt/sharedtask2021.

References

Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)
Google Scholar
Bentivogli, L., Pianta, E.: Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus. Nat. Lang. Eng. 11(3), 247–261 (2005)
Article Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Gessler, L., Behzad, S., Liu, Y.J., Peng, S., Zhu, Y., Zeldes, A.: Discodisco at the DISRPT2021 shared task: a system for discourse segmentation, classification, and connective detection. CoRR abs/2109.09777 (2021). https://arxiv.org/abs/2109.09777
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python (2020). https://spacy.io/
Jalili Sabet, M., Dufter, P., Yvon, F., Schütze, H.: SimAlign: high quality word alignments without parallel training data using static and contextualized embeddings. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, (EMNLP 2020), Punta Cana, Dominican Republic, pp. 1627–1643, November 2020. https://www.aclweb.org/anthology/2020.findings-emnlp.147
Johannsen, A., Søgaard, A.: Disambiguating explicit discourse connectives without oracles. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, (IJCNLP 2013), Nagoya, Japan, pp. 997–1001, October 2013. https://aclanthology.org/I13-1134
Laali, M.: Inducing discourse resources using annotation projection, Ph.D. thesis, Concordia University, November 2017. https://spectrum.library.concordia.ca/983791/
Laali, M., Kosseim, L.: Improving discourse relation projection to build discourse annotated corpora. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, (RANLP 2017), Varna, Bulgaria, pp. 407–416, September 2017. https://doi.org/10.26615/978-954-452-049-6_0_54
Liu, Y., Liu, Q., Lin, S.: Log-linear models for word alignment. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, Michigan, pp. 459–466, June 2005. https://doi.org/10.3115/1219840.1219897. https://aclanthology.org/P05-1057
Liu, Y., Sun, M.: Contrastive unsupervised word alignment with non-local features. In: Proceedings of the Twenty-Ninth Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI 2015), pp. 2295–2301 (2015). http://arxiv.org/abs/1410.2082
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: a framework for the analysis of texts. IPrA Papers Pragmatics 1, 79–105 (1987)
Google Scholar
Muller, P., Braud, C., Morey, M.: ToNy: contextual embeddings for accurate multilingual discourse segmentation of full documents. In: Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, Minneapolis, MN, pp. 115–124, June 2019. https://doi.org/10.18653/v1/W19-2715. https://aclanthology.org/W19-2715
Pitler, E., Nenkova, A.: Using syntax to disambiguate explicit discourse connectives in text. In: Proceedings of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP 2009), Suntec, Singapore, pp. 13–16, August 2009. https://aclanthology.org/P09-2004
Prasad, R., et al.: The Penn discourse TreeBank 2.0. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 2961–2968, May 2008. http://www.lrec-conf.org/proceedings/lrec2008/pdf/754_paper.pdf
Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 2214–2218, May 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Tiedemann, J.: Improving the cross-lingual projection of syntactic dependencies. In: Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), Vilnius, Lithuania, pp. 191–199, May 2015. https://aclanthology.org/W15-1824
Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish. CoRR abs/1912.07076 (2019). http://arxiv.org/abs/1912.07076
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771 (2019). http://arxiv.org/abs/1910.03771
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (HLT 2001), San Diego, California, pp. 1–8, March 2001. https://aclanthology.org/H01-1035
Zeldes, A., Liu, J.: DISRPT 2021 task 2 results (2021). https://sites.google.com/georgetown.edu/disrpt2021/results#h.gb445xshqmt7. https://sites.google.com/georgetown.edu/disrpt2021/results
Zeyrek, D., Kurfalı, M.: TDB 1.1: extensions on Turkish discourse bank. In: Proceedings of the 11th Linguistic Annotation Workshop, Valencia, Spain, pp. 76–81, April 2017. https://doi.org/10.18653/v1/W17-0809. https://aclanthology.org/W17-0809
Zeyrek, D., Kurfalı, M.: An assessment of explicit inter- and intra-sentential discourse connectives in Turkish discourse bank. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, pp. 4023–4029, May 2018. https://aclanthology.org/L18-1634
Zhou, L., Gao, W., Li, B., Wei, Z., Wong, K.F.: Cross-lingual identification of Ambiguous discourse Connectives for resource-poor Language. In: Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers (COLING 2012), Mumbai, pp. 1409–1418, December 2012. https://aclanthology.org/C12-2138

Download references

Acknowledgment

The authors would like to thank the anonymous reviewers for their valuable comments on an earlier version of this paper. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Computational Linguistics at Concordia (CLaC) Laboratory, Department of Computer Science and Software Engineering, Concordia University, Montréal, QC, Canada
Thomas Chapados Muermans & Leila Kosseim

Authors

Thomas Chapados Muermans
View author publications
You can also search for this author in PubMed Google Scholar
Leila Kosseim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leila Kosseim .

Editor information

Editors and Affiliations

Universitat Politècnica de València, Valencia, Spain
Paolo Rosso
University of Turin, Torino, Italy
Valerio Basile
Universidad Nacional de Educación a Distancia, Madrid, Spain
Raquel Martínez
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Derby, Derby, UK
Farid Meziane

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chapados Muermans, T., Kosseim, L. (2022). A BERT-Based Approach for Multilingual Discourse Connective Detection. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-08473-7_41
Published: 13 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08472-0
Online ISBN: 978-3-031-08473-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A BERT-Based Approach for Multilingual Discourse Connective Detection