Language Identification for South African Bantu Languages Using Rank Order Statistics

Dube, Meluleki; Suleman, Hussein

doi:10.1007/978-3-030-34058-2_26

Language Identification for South African Bantu Languages Using Rank Order Statistics

Conference paper
First Online: 29 October 2019

703 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11853))

Abstract

Language identification is an important pre-process in many data management and information retrieval and transformation systems. However, Bantu languages are known to be difficult to identify because of lack of data and language similarity. This paper investigates the performance of n-gram counting using rank orders in order to discriminate among the different Bantu languages spoken in South Africa, using varying test and training data sizes. The highest average accuracy obtained was 99.3% with a testing size of 495 characters and training size of 600000 characters. The lowest average accuracy obtained was 78.72% when the testing size was 15 characters and learning size was 200000 characters.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Botha, G.R., Barnard, E.: Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012)
Article Google Scholar
Cavnar, W.B., Trenkle, J.M., et al.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, vol. 161175. Citeseer (1994)
Google Scholar
Chavula, C., Suleman, H.: Assessing the impact of vocabulary similarity on multilingual information retrieval for Bantu languages. In: Proceedings of the 8th Annual Meeting of the Forum on Information Retrieval Evaluation, pp. 16–23. ACM (2016)
Google Scholar
Combrinck, H.P., Botha, E.: Text-based automatic language identification. In: Proceedings of the 6th Annual Symposium of the Pattern Recognition Association of South Africa (1995)
Google Scholar
Dunning, T.: Statistical Identification of Language. Las Cruces, Computing Research Laboratory (1994)
Google Scholar
Duvenhage, B., Ntini, M., Ramonyai, P.: Improved text language identification for the South African languages. In: 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp. 214–218. IEEE (2017)
Google Scholar
Li, W.: Random texts exhibit zipf’s-law-like word frequency distribution. IEEE Trans. Inf. Theory 38(6), 1842–1845 (1992)
Article Google Scholar
McNamee, P.: Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20(3), 94–101 (2005)
Google Scholar
Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L.: The effects of a corpus on isizulu spellcheckers based on n-grams. In: 2016 IST-Africa Week Conference, pp. 1–10. IEEE (2016)
Google Scholar
Poole, D., Mackworth, A.: Artificial intelligence foundations of computational agents. 2010 (2017)
Google Scholar
Zulu, P., Botha, G., Barnard, E.: Orthographic measures of language distances between the official South African languages. Literator: J. Lit. Crit. Comp. Linguist. Lit. Stud. 29(1), 185–204 (2008)
Article Google Scholar

Download references

Acknowledgements

This research was partially funded by the National Research Foundation of South Africa (Grant numbers: 85470 and 105862) and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors, and that the NRF accepts no liability whatsoever in this regard.

Author information

Authors and Affiliations

University of Cape Town, Cape Town, South Africa
Meluleki Dube & Hussein Suleman

Authors

Meluleki Dube
View author publications
You can also search for this author in PubMed Google Scholar
Hussein Suleman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hussein Suleman .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Adam Jatowt
Ritsumeikan University, Kusatsu, Japan
Akira Maeda
The Catholic University of America, Washington, DC, USA
Sue Yeon Syn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dube, M., Suleman, H. (2019). Language Identification for South African Bantu Languages Using Rank Order Statistics. In: Jatowt, A., Maeda, A., Syn, S. (eds) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science(), vol 11853. Springer, Cham. https://doi.org/10.1007/978-3-030-34058-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-34058-2_26
Published: 29 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34057-5
Online ISBN: 978-3-030-34058-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics