Multi-Wiki90k: Multilingual Benchmark Dataset for Paragraph Segmentation

Swędrowski, Michał; Miłkowski, Piotr; Bojanowski, Bartłomiej; Kocoń, Jan

doi:10.1007/978-3-031-16210-7_11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1653))

Included in the following conference series:

International Conference on Computational Collective Intelligence

747 Accesses

Abstract

In this paper, we present paragraph segmentation using cross-lingual knowledge transfer models. In our solution, we investigate the quality of multilingual models, such as mBERT and XLM-RoBERTa, as well as language independent models, LASER and LaBSE. We study the quality of segmentation in 9 different European languages, both for each language separately and for all languages simultaneously. We offer high quality solutions while maintaining language independence. To achieve our goals, we introduced a new multilingual benchmark dataset called Multi-Wiki90k.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7, 597–610 (2019)
Article Google Scholar
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1), 177–210 (1999)
Article MATH Google Scholar
Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16(9), 575–577 (1973)
Article MATH Google Scholar
Chen, H., Branavan, S., Barzilay, R., Karger, D.R.: Global models of document structure using latent permutations. Association for Computational Linguistics (2009)
Google Scholar
Choi, F.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, pp. 26–33 (2000)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Fabricius-Hansen, C.: Information packaging and translation: aspects of translational sentence splitting (German-English/Norwegian). Sprachspezifische Aspekte der Informationsverteilung pp. 175–214 (1999)
Google Scholar
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic Bert sentence embedding. arXiv preprint arXiv:2007.01852 (2020)
Fournier, C.: Evaluating text segmentation using boundary edit distance. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1702–1712 (2013)
Google Scholar
Glavaš, G., Nanni, F., Ponzetto, S.P.: Unsupervised text segmentation using semantic relatedness graphs. In: Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pp. 125–130 (2016)
Google Scholar
Glavaš, G., Somasundaran, S.: Two-level transformer and auxiliary coherence modeling for improved text segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 7797–7804 (2020)
Google Scholar
Hearst, M.A.: Texttiling: a quantitative approach to discourse. Technical report USA (1993)
Google Scholar
Hearst, M.A.: Multi-paragraph segmentation of expository text. In: 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9–16 (1994)
Google Scholar
Hearst, M.A.: Text tiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180 (2007)
Google Scholar
Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised learning task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473. Association for Computational Linguistics, New Orleans, Louisiana, June 2018. https://doi.org/10.18653/v1/N18-2075, https://www.aclweb.org/anthology/N18-2075
Kozima, H.: Text segmentation based on similarity between words. In: 31st Annual Meeting of the Association for Computational Linguistics, pp. 286–288 (1993)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
McNamee, P., Mayfield, J.: Character n-gram tokenization for European language text retrieval. Inf. Retrieval 7(1), 73–97 (2004)
Article Google Scholar
Morris, J., Hirst, G.: Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17(1), 21–48 (1991)
Google Scholar
Passonneau, R.J., Litman, D.J.: Discourse segmentation by human and automated means. Comput. Linguist. 23(1), 103–139 (1997)
Google Scholar
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)
Article Google Scholar
Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual Bert? arXiv preprint arXiv:1906.01502 (2019)
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2020. https://arxiv.org/abs/2004.09813
Sporleder, C., Lapata, M.: Broad coverage paragraph segmentation across languages and domains. ACM Trans. Speech Language Process. (TSLP) 3(2), 1–35 (2006)
Article Google Scholar
Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 499–506 (2001)
Google Scholar
Virameteekul, P.: Paragraph-level attention based deep model for chapter segmentation. PeerJ Comput. Sci. 8, e1003 (2022)
Article Google Scholar

Download references

Acknowledgements

This work was financed by (1) the National Science Centre, Poland, project no. 2019/33 /B/HS2/02814; (2) the Polish Ministry of Education and Science, CLARIN-PL; (3) the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN – Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19; (4) the statutory funds of the Department of Artificial Intelligence, Wrocław University of Science and Technology.

Author information

Authors and Affiliations

Wrocław University of Science and Technology, 50-370, Wrocław, Poland
Michał Swędrowski, Piotr Miłkowski, Bartłomiej Bojanowski & Jan Kocoń

Authors

Michał Swędrowski
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Miłkowski
View author publications
You can also search for this author in PubMed Google Scholar
Bartłomiej Bojanowski
View author publications
You can also search for this author in PubMed Google Scholar
Jan Kocoń
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michał Swędrowski .

Editor information

Editors and Affiliations

University of Craiova, Craiova, Romania
Costin Bădică
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Jan Treur
Claude Bernard University Lyon 1, Villeurbanne Cedex, France
Djamal Benslimane
Wrocław University of Science and Technology, Wrocław, Poland
Bogumiła Hnatkowska
Wrocław University of Science and Technology, Wrocław, Poland
Marek Krótkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Swędrowski, M., Miłkowski, P., Bojanowski, B., Kocoń, J. (2022). Multi-Wiki90k: Multilingual Benchmark Dataset for Paragraph Segmentation. In: Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M. (eds) Advances in Computational Collective Intelligence. ICCCI 2022. Communications in Computer and Information Science, vol 1653. Springer, Cham. https://doi.org/10.1007/978-3-031-16210-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-16210-7_11
Published: 21 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16209-1
Online ISBN: 978-3-031-16210-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-Wiki90k: Multilingual Benchmark Dataset for Paragraph Segmentation