Abstract
Tswana, a Bantu language in the Sotho group, is characterised by an agglutinative morphology and a disjunctive orthography, which mainly affects the verb category. In particular, verbal prefixes are usually written disjunctively, while suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two finite state tokeniser transducers and a finite state morphological analyser are combined to solve the Tswana (verb) tokenisation problem. The approach has the important advantage of bringing the processing of Tswana, beyond the morphological analysis level, in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when implemented and evaluated, yields an F1-score of 95 % with respect to a hand tokenised gold standard.


Similar content being viewed by others
References
Anderson, W.N. (2014). Private communication.
Anderson, W. N. & Kotzé, P. M. (2006). Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho. In Proceedings of the 5th international conference on language resources and evaluation, Genoa, Italy, May 22–28, 2006.
Beesley, K. R. & Karttunen, L. (2003). Finite state morphology. Cambridge: Cambridge University Press.
Cole, D. T. & Moncho-Warren, L. (2012). Setswana and English illustrated dictionary. Northlands, Gauteng, SA: MacMillan South Africa.
Dixon, R. M. W. & Aikhenvald, A. Y. (2002). Word: A cross-linguistic typology. Cambridge: Cambridge University Press.
Farghaly, A. (2003). Handbook for language engineers. Stanford University: CSLI Publications.
Forst, M. & Kaplan, R. M. (2006). The importance of precise tokenization for deep grammars. In Proceedings of the 5th international conference on language resources and evalution, Genoa, Italy, May 22–28, 2006.
Hurskainen, A., Louwrens, L. & Poulos, G. (2005). Computational description of verbs in disjoining writing systems. Nordic Journal of African Studies, 14(4), 438–451.
Jurafsky, D. & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (2nd ed.). New Jersey: Pearson Education.
Kosch, I. M. (2006). Topics in morphology in the African language context. Pretoria: Unisa Press.
Kotzé, P. M. (2011). Tokenization rules for the disjunctively written verbal segment of Northern Sotho. South African Journal of African Languages, 31(1), 121–137.
Krüger, C. J. H. (2006). Introduction to the morphology of Tswana. München: Lincom Europe.
Mikheev, A. (2003). Text segmentation. In R. Mitkov (Ed.), The Oxford handbook of computational linguistics (pp. 201–218). Oxford: Oxford University Press.
Otlogetswe, T. J. (2007). Corpus design for Tswana lexicography. Ph.D. thesis. University of Pretoria, Pretoria, South Africa.
Palmer, D. D. (2000). Tokenisation and sentence segmentation. In R. Dale, H. Moisl & H. Somers (Eds.), Handbook of natural language processing (pp. 11–35). New York: Marcel Dekker Inc.
Poulos, G. & Louwrens, L. J. (1994). A linguistics analysis of Northern Sotho. Pretoria, South Africa: Via Africa.
Poulos, G. & Msimang, C. T. (1998). A linguistics analysis of Zulu. Pretoria, South Africa: Via Africa.
Pretorius, R. S. (1997). Auxiliary verbs as a sub-category of the verb in Tswana. Ph.D. thesis. Potchefstroom University for CHE, Potchefstroom, South Africa.
Pretorius, R., Berg, A. & Pretorius, L. (2012). Multiple object agreement morphemes in Tswana: A computational approach. Southern African Linguistics and Applied Language Studies, Special issue: Language technology in Southern Africa: Subject and object marking in Bantu, 30(2), 203–218.
Pretorius, R., Berg, A., Pretorius, L. & Viljoen, B. (2009). Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In G. De Pauw, G. M. de Schryver & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African Languages (AfLaT ‘09) (pp. 66–73). Stroudsburg, PA: Association for Computational Linguistics.
Pretorius, R., Viljoen, B. & Pretorius, L. (2005). A finite-state morphological analysis of Tswana nouns. South African Journal of African Languages, 25(1), 48–58.
Pretorius, L., Viljoen, B., Pretorius, R. & Berg, A. (2008). Towards a computational morphological analysis of Tswana compounds. Literator, 29(1), 1–20.
Taljard, E. & Bosch, S. E. (2006). A comparison of approaches towards word class tagging: Disjunctively versus conjunctively written Bantu languages. Nordic Journal of African Studies, 15(4), 428–442.
Van Wyk, E. B. (1958). Woordverdeling in Noord-Sotho en Zoeloe: ‘n Bydrae tot die vraagstuk van woordidentifikasie in die Bantoetale. Pretoria: University of Pretoria.
Van Wyk, E. B. (1967). The word classes of Northern Sotho. Lingua, 17(2), 230–261.
Author information
Authors and Affiliations
Corresponding author
Appendix
Rights and permissions
About this article
Cite this article
Pretorius, L., Viljoen, B., Berg, A. et al. Tswana finite state tokenisation. Lang Resources & Evaluation 49, 831–856 (2015). https://doi.org/10.1007/s10579-014-9292-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-014-9292-1