Data-Driven Part-of-Speech Tagging of Kiswahili

De Pauw, Guy; de Schryver, Gilles-Maurice; Wagacha, Peter W.

doi:10.1007/11846406_25

Guy De Pauw²¹,
Gilles-Maurice de Schryver^22,23 &
Peter W. Wagacha²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4188))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1115 Accesses

Abstract

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers. We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting. This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Part-of-Speech Tagging in Mizo Language: A Preliminary Study

Analyzing Tagging Accuracy of Part-of-Speech Taggers

References

van Rooy, B., Pretorius, R.: A word-class tagset for Setswana. Southern African Linguistics and Applied Language Studies 21(4), 203–222 (2003)
Article Google Scholar
Allwood, J., Grönqvist, L., Hendrikse, A.P.: Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa. Southern African Linguistics and Applied Language Studies 21(4), 223–237 (2003)
Article Google Scholar
Prinsloo, D.J., Heid, U.: Creating word class tagged corpora for Northern Sotho by linguistically informed bootstrapping. In: Proceedings of the Conference on Lesser Used Languages & Computer Linguistics (LULCL 2005), Bozen/Bolzano, Italy (to be published, 2005)
Google Scholar
Taljard, E., Bosch, S.E.: A comparison of approaches towards word class tagging: disjunctively vs conjunctively written Bantu languages. In: Proceedings of the Conference on Lesser Used Languages & Computer Linguistics (LULCL 2005), Bozen/Bolzano, Italy (to be published, 2005)
Google Scholar
Pretorius, L., Bosch, S.E.: Computational aids for Zulu natural language processing. Southern African Linguistics and Applied Language Studies 21(4), 267–282 (2003)
Article Google Scholar
Hurskainen, A.: HCS 2004 – Helsinki Corpus of Swahili. Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC (2004)
Google Scholar
Hurskainen, A.: Disambiguation of morphological analysis in Bantu languages. In: Proceedings of the Sixteenth International Conference on Computational Linguistics (COLING 1996), Copenhagen, Denmark, pp. 568–573 (1996)
Google Scholar
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLP 2000), Seattle, WA, USA, pp. 224–231 (2000)
Google Scholar
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Somerset, NJ, USA, pp. 133–142 (1996)
Google Scholar
van Halteren, H., Zavrel, J., Daelemans, W.: Improving accuracy in word class tagging through combination of machine learning systems. Computational Linguistics 27(2), 199–230 (2001)
Article Google Scholar
Daelemans, W., Zavrel, J., van den Bosch, A., van der Sloot, K.: MBT: Memory Based Tagger, version 2.0, Reference Guide. ILK Research Group Technical Report Series 03-13, Tilburg (2003)
Google Scholar
Wagacha, P., Manderick, B., Getao, K.: Benchmarking Support Vector Machines using StatLog Methodology. In: Proceedings of Benelearn 2004, Machine Learning Conference of Belgium and the Netherlands, Brussels, Belgium, pp. 185–190 (2004)
Google Scholar
Giménez, J., Màrquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 43–46 (2004)
Google Scholar
Joachims, T.: Making Large-scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 41–56. MIT Press, Boston (1999)
Google Scholar
De Pauw, G., Daelemans, W.: The role of algorithm bias vs information source in learning algorithms for morphosyntactic disambiguation. In: Proceedings of the Fourth Conference on Computational Natural Language Learning (CoNLL 2000), Lisbon, Portugal, pp. 19–24 (2000)
Google Scholar
Brill, E.: A simple rule-based part-of-speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing (ANLP 1992), Trento, Italy, pp. 152–155 (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

CNTS – Language Technology Group, University of Antwerp, Belgium
Guy De Pauw
African Languages and Cultures, Ghent University, Belgium
Gilles-Maurice de Schryver
Xhosa Department, University of the Western Cape, South Africa
Gilles-Maurice de Schryver
School of Computing and Informatics, University of Nairobi, Kenya
Peter W. Wagacha

Authors

Guy De Pauw
View author publications
You can also search for this author in PubMed Google Scholar
Gilles-Maurice de Schryver
View author publications
You can also search for this author in PubMed Google Scholar
Peter W. Wagacha
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 60200, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Pauw, G., de Schryver, GM., Wagacha, P.W. (2006). Data-Driven Part-of-Speech Tagging of Kiswahili. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2006. Lecture Notes in Computer Science(), vol 4188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846406_25

Download citation

DOI: https://doi.org/10.1007/11846406_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39090-9
Online ISBN: 978-3-540-39091-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data-Driven Part-of-Speech Tagging of Kiswahili

Abstract

Access this chapter

Preview

Similar content being viewed by others

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Part-of-Speech Tagging in Mizo Language: A Preliminary Study

Analyzing Tagging Accuracy of Part-of-Speech Taggers

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Data-Driven Part-of-Speech Tagging of Kiswahili

Abstract

Access this chapter

Preview

Similar content being viewed by others

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Part-of-Speech Tagging in Mizo Language: A Preliminary Study

Analyzing Tagging Accuracy of Part-of-Speech Taggers

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation