Multilingual Tokenization and Part-of-speech Tagging. Lightweight Versus Heavyweight Algorithms

Boros, Tiberiu; Dumitrescu, Stefan Daniel

doi:10.1007/978-3-319-93782-3_11

Tiberiu Boros¹⁶ &
Stefan Daniel Dumitrescu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10930))

Included in the following conference series:

Language and Technology Conference

572 Accesses

Abstract

This work focuses on morphological analysis of raw text and provides a recipe for tokenization, sentence splitting and part-of-speech tagging for all languages included in the Universal Dependencies Corpus. Scalability is an important issue when dealing with large-sized multilingual corpora. The experiments include both lightweight classifiers (linear and decision trees) and heavyweight LSTM-based architectures which are able to attain state-of-the-art results. All the experiments are carried out using the provided data “as-is”. We apply lightweight and heavyweight classifiers on 5 distinct tasks, on multiple languages; we present some lessons learned during the training process; we look at per-language results as well as task averages, we present model footprints, and finally draw a few conclusions regarding trade-offs between the classifiers’ characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards Combining Multitask and Multilingual Learning

Multilingual Dependency Parsing from Universal Dependencies to Sesame Street

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Article Open access 18 October 2021

Notes

1.
http://slp.racai.ro/index.php/mlpla-new/.
2.
In most of our experiments we set $\alpha =10^{-4}$.
3.
After a number of tests, we fixed $h=5$ for all languages.
4.
In our experiments we observed that $k=10$ is a good choice for many of the languages we used for tunning.
5.
http://slp.racai.ro/index.php/mlpla-new/.

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016). arXiv preprint arXiv:1607.04606
Boroş, T., Dumitrescu, S.D., Pipa, S.: Fast and accurate decision trees for natural language processing tasks. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, INCOMA Ltd., Varna, Bulgaria, pp. 103–110, September 2017. https://doi.org/10.26615/978-954-452-049-6_016
Chen, D., Manning, C.D.: A fast and accurate dependency parser using neural networks. In: EMNLP, pp. 740–750 (2014)
Google Scholar
Dozat, T., Manning, C.D.: Deep Biaffine attention for neural dependency parsing (2016). arXiv preprint arXiv:1611.01734
Dozat, T., Qi, P., Manning, C.D.: Stanford’s graph-based neural dependency parser at the CoNLL 2017 shared task. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 20–30. Association for Computational Linguistics, Vancouver, Canada, August 2017. http://www.aclweb.org/anthology/K/K17/K17-3002.pdf
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Nivre, J., et al.: Universal Dependencies 2.0 (2017). http://hdl.handle.net/11234/1-1983, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague. http://hdl.handle.net/11234/1-1983
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset (2011). arXiv preprint arXiv:1104.2086
Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987)
Article Google Scholar
Tufiş, D., Dragomirescu, L.: Tiered tagging revisited. In: Proceedings of the 4th LREC Conference, pp. 39–42 (2004)
Google Scholar
Zafiu, A., Dumitrescu, S.D., Boroş, T.: Modular language processing framework for lightweight applications (MLPLA). In: 7th Language & Technology Conference (2015)
Google Scholar
Zeman, D., Ginter, F., Hajič, J., Nivre, J., Popel, M., Straka, M., et al.: CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1–20. Association for Computational Linguistics (2017)
Google Scholar
Zeman, D., Popel, M., Nitisaroj, R., Li, J.: CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1–19. Association for Computational Linguistics, Vancouver, Canada, August 2017. http://www.aclweb.org/anthology/K/K17/K17-3001.pdf

Download references

Author information

Authors and Affiliations

Research Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Tiberiu Boros & Stefan Daniel Dumitrescu

Authors

Tiberiu Boros
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Daniel Dumitrescu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tiberiu Boros or Stefan Daniel Dumitrescu .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
LIMSI-CNRS, Orsay Cedex, France
Joseph Mariani
Adam Mickiewicz University, Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boros, T., Dumitrescu, S.D. (2018). Multilingual Tokenization and Part-of-speech Tagging. Lightweight Versus Heavyweight Algorithms. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-93782-3_11
Published: 16 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93781-6
Online ISBN: 978-3-319-93782-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics