Skip to main content

A BERT-Based Approach for Multilingual Discourse Connective Detection

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13286))

Abstract

In this paper, we report on our experiments towards multilingual discourse connective (or DC) identification and show how language specific BERT models seem to be sufficient even with little task-specific training data. While some languages have large corpora with human annotated DCs, most languages are low in such resources. Hence, relying solely on discourse annotated corpora to train a DC identification system for low resourced languages is insufficient. To address this issue, we developed a model based on pretrained BERT and fine-tuned it with discourse annotated data of varying sizes. To measure the effect of larger training data, we induced synthetic training corpora with DC annotations using word-aligned parallel corpora. We evaluated our models on 3 languages: English, Turkish and Mandarin Chinese in the context of the recent DISRPT 2021 Task 2 shared task. Results show that the F-measure achieved by the standard BERT model (92.49%, 93.97%, 87.42% for English, Turkish and Chinese) is hard to improve upon even with larger task specific training corpora

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://sites.google.com/georgetown.edu/disrpt2021/home.

  2. 2.

    https://sites.google.com/georgetown.edu/disrpt2021/home.

  3. 3.

    Using the official DISRPT 2021 scorer available at https://github.com/disrpt/sharedtask2021.

References

  1. Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  2. Bentivogli, L., Pianta, E.: Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus. Nat. Lang. Eng. 11(3), 247–261 (2005)

    Article  Google Scholar 

  3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

  4. Gessler, L., Behzad, S., Liu, Y.J., Peng, S., Zhu, Y., Zeldes, A.: Discodisco at the DISRPT2021 shared task: a system for discourse segmentation, classification, and connective detection. CoRR abs/2109.09777 (2021). https://arxiv.org/abs/2109.09777

  5. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python (2020). https://spacy.io/

  6. Jalili Sabet, M., Dufter, P., Yvon, F., Schütze, H.: SimAlign: high quality word alignments without parallel training data using static and contextualized embeddings. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, (EMNLP 2020), Punta Cana, Dominican Republic, pp. 1627–1643, November 2020. https://www.aclweb.org/anthology/2020.findings-emnlp.147

  7. Johannsen, A., Søgaard, A.: Disambiguating explicit discourse connectives without oracles. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, (IJCNLP 2013), Nagoya, Japan, pp. 997–1001, October 2013. https://aclanthology.org/I13-1134

  8. Laali, M.: Inducing discourse resources using annotation projection, Ph.D. thesis, Concordia University, November 2017. https://spectrum.library.concordia.ca/983791/

  9. Laali, M., Kosseim, L.: Improving discourse relation projection to build discourse annotated corpora. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, (RANLP 2017), Varna, Bulgaria, pp. 407–416, September 2017. https://doi.org/10.26615/978-954-452-049-6_0_54

  10. Liu, Y., Liu, Q., Lin, S.: Log-linear models for word alignment. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, Michigan, pp. 459–466, June 2005. https://doi.org/10.3115/1219840.1219897. https://aclanthology.org/P05-1057

  11. Liu, Y., Sun, M.: Contrastive unsupervised word alignment with non-local features. In: Proceedings of the Twenty-Ninth Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI 2015), pp. 2295–2301 (2015). http://arxiv.org/abs/1410.2082

  12. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: a framework for the analysis of texts. IPrA Papers Pragmatics 1, 79–105 (1987)

    Google Scholar 

  13. Muller, P., Braud, C., Morey, M.: ToNy: contextual embeddings for accurate multilingual discourse segmentation of full documents. In: Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, Minneapolis, MN, pp. 115–124, June 2019. https://doi.org/10.18653/v1/W19-2715. https://aclanthology.org/W19-2715

  14. Pitler, E., Nenkova, A.: Using syntax to disambiguate explicit discourse connectives in text. In: Proceedings of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP 2009), Suntec, Singapore, pp. 13–16, August 2009. https://aclanthology.org/P09-2004

  15. Prasad, R., et al.: The Penn discourse TreeBank 2.0. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 2961–2968, May 2008. http://www.lrec-conf.org/proceedings/lrec2008/pdf/754_paper.pdf

  16. Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 2214–2218, May 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf

  17. Tiedemann, J.: Improving the cross-lingual projection of syntactic dependencies. In: Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), Vilnius, Lithuania, pp. 191–199, May 2015. https://aclanthology.org/W15-1824

  18. Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish. CoRR abs/1912.07076 (2019). http://arxiv.org/abs/1912.07076

  19. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771 (2019). http://arxiv.org/abs/1910.03771

  20. Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (HLT 2001), San Diego, California, pp. 1–8, March 2001. https://aclanthology.org/H01-1035

  21. Zeldes, A., Liu, J.: DISRPT 2021 task 2 results (2021). https://sites.google.com/georgetown.edu/disrpt2021/results#h.gb445xshqmt7. https://sites.google.com/georgetown.edu/disrpt2021/results

  22. Zeyrek, D., Kurfalı, M.: TDB 1.1: extensions on Turkish discourse bank. In: Proceedings of the 11th Linguistic Annotation Workshop, Valencia, Spain, pp. 76–81, April 2017. https://doi.org/10.18653/v1/W17-0809. https://aclanthology.org/W17-0809

  23. Zeyrek, D., Kurfalı, M.: An assessment of explicit inter- and intra-sentential discourse connectives in Turkish discourse bank. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, pp. 4023–4029, May 2018. https://aclanthology.org/L18-1634

  24. Zhou, L., Gao, W., Li, B., Wei, Z., Wong, K.F.: Cross-lingual identification of Ambiguous discourse Connectives for resource-poor Language. In: Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers (COLING 2012), Mumbai, pp. 1409–1418, December 2012. https://aclanthology.org/C12-2138

Download references

Acknowledgment

The authors would like to thank the anonymous reviewers for their valuable comments on an earlier version of this paper. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leila Kosseim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chapados Muermans, T., Kosseim, L. (2022). A BERT-Based Approach for Multilingual Discourse Connective Detection. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08473-7_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08472-0

  • Online ISBN: 978-3-031-08473-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics