Skip to main content

Data-Driven Part-of-Speech Tagging of Kiswahili

  • Conference paper
Text, Speech and Dialogue (TSD 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4188))

Included in the following conference series:

  • 1115 Accesses

Abstract

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers. We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting. This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. van Rooy, B., Pretorius, R.: A word-class tagset for Setswana. Southern African Linguistics and Applied Language Studies 21(4), 203–222 (2003)

    Article  Google Scholar 

  2. Allwood, J., Grönqvist, L., Hendrikse, A.P.: Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa. Southern African Linguistics and Applied Language Studies 21(4), 223–237 (2003)

    Article  Google Scholar 

  3. Prinsloo, D.J., Heid, U.: Creating word class tagged corpora for Northern Sotho by linguistically informed bootstrapping. In: Proceedings of the Conference on Lesser Used Languages & Computer Linguistics (LULCL 2005), Bozen/Bolzano, Italy (to be published, 2005)

    Google Scholar 

  4. Taljard, E., Bosch, S.E.: A comparison of approaches towards word class tagging: disjunctively vs conjunctively written Bantu languages. In: Proceedings of the Conference on Lesser Used Languages & Computer Linguistics (LULCL 2005), Bozen/Bolzano, Italy (to be published, 2005)

    Google Scholar 

  5. Pretorius, L., Bosch, S.E.: Computational aids for Zulu natural language processing. Southern African Linguistics and Applied Language Studies 21(4), 267–282 (2003)

    Article  Google Scholar 

  6. Hurskainen, A.: HCS 2004 – Helsinki Corpus of Swahili. Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC (2004)

    Google Scholar 

  7. Hurskainen, A.: Disambiguation of morphological analysis in Bantu languages. In: Proceedings of the Sixteenth International Conference on Computational Linguistics (COLING 1996), Copenhagen, Denmark, pp. 568–573 (1996)

    Google Scholar 

  8. Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLP 2000), Seattle, WA, USA, pp. 224–231 (2000)

    Google Scholar 

  9. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Somerset, NJ, USA, pp. 133–142 (1996)

    Google Scholar 

  10. van Halteren, H., Zavrel, J., Daelemans, W.: Improving accuracy in word class tagging through combination of machine learning systems. Computational Linguistics 27(2), 199–230 (2001)

    Article  Google Scholar 

  11. Daelemans, W., Zavrel, J., van den Bosch, A., van der Sloot, K.: MBT: Memory Based Tagger, version 2.0, Reference Guide. ILK Research Group Technical Report Series 03-13, Tilburg (2003)

    Google Scholar 

  12. Wagacha, P., Manderick, B., Getao, K.: Benchmarking Support Vector Machines using StatLog Methodology. In: Proceedings of Benelearn 2004, Machine Learning Conference of Belgium and the Netherlands, Brussels, Belgium, pp. 185–190 (2004)

    Google Scholar 

  13. Giménez, J., Màrquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 43–46 (2004)

    Google Scholar 

  14. Joachims, T.: Making Large-scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 41–56. MIT Press, Boston (1999)

    Google Scholar 

  15. De Pauw, G., Daelemans, W.: The role of algorithm bias vs information source in learning algorithms for morphosyntactic disambiguation. In: Proceedings of the Fourth Conference on Computational Natural Language Learning (CoNLL 2000), Lisbon, Portugal, pp. 19–24 (2000)

    Google Scholar 

  16. Brill, E.: A simple rule-based part-of-speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing (ANLP 1992), Trento, Italy, pp. 152–155 (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

De Pauw, G., de Schryver, GM., Wagacha, P.W. (2006). Data-Driven Part-of-Speech Tagging of Kiswahili. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2006. Lecture Notes in Computer Science(), vol 4188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846406_25

Download citation

  • DOI: https://doi.org/10.1007/11846406_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-39090-9

  • Online ISBN: 978-3-540-39091-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics