Abstract
Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their pre-processing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turkish, Malay and Amharic. Unfortunately, no algorithm has been used to stem text in Hausa, a Chadic language spoken in West Africa. To address this need, we propose stemming Hausa text using affix-stripping rules and reference lookup. We stemmed Hausa text, using 78 affix stripping rules applied in 4 steps and a reference look-up consisting of 1500 Hausa root words. The over-stemming index, under-stemming index, stemmer weight, word stemmed factor, correctly stemmed words factor and average words conflation factor were calculated to determine the effect of reference look-up on the strength and accuracy of the stemmer. It was observed that reference look-up aided in reducing both over-stemming and under-stemming errors, increased accuracy and has a tendency to reduce the strength of an affix stripping stemmer. The rationality behind the approach used is discussed and directions for future research are identified.

References
Ahmad, F., Yusoff, M., & Sembok, T. M. T. (1996). Experiments with a stemming algorithm for Malay words. Journal of the American Society for Information Science, 47(12), 909–918.
Alemayehu, N., & Willett, P. (2002). Stemming of Amharic Words for Information Retrieval. Literary and Linguistic Computing, 17(1)
Alhanini, Y., Juzaiddin, M., & Aziz, A. (2011). The enhancement of arabic stemming by using light stemming and dictionary-based stemming. Journal of Software Engineering and Applications, 4, 522–526.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Newton: O’Reilly Media Inc.
Darwis, S. A., Rukaini, A., & Idris, N. (2012). Exhaustive affix stripping and a Malay word register to solve stemming errors and ambiguity problem in Malay stemmers. Malaysian Journal of Computer Science, 25(4), 196–209.
Dawson, J. (1974). Suffix removal and word conflation. Bulletin of the Association for Literary and Linguistic Computing, 2(3), 33–46.
Frakes, W. B., & Baeza-Yates, R. (1992). Information retrieval: Data structures and algorithms (pp. 161–218). Englewood Cliffs, NJ: Prentice Hall.
Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum, 37(1), 26–30.
Idris, N., & Mustapha S. M. F. D. (2001). Stemming for term conflation in Malay texts. In International conference on artificial intelligence (IC-AI. Las Vegas) (pp. 1512–1517).
Jaggar, P. J. (2001). Hausa. Reading: John Benjamins Publishing.
Jivani, A. G. (2011). A comparative study of stemming Algorithms. International Journal of Computer Technology and Applications, 2(6), 1930–1938.
Khorsi, A. (2012). Effective unsupervised Arabic word stemming: Towards an unsupervised radicals extraction. The International Arab Journal of Information Technology, 9(6), 571–577.
Kraaij, W., & Pohlmann, R. (1994). Porter’s stemming algorithm for Dutch. In L.G.M. Noordman & W.A.M. de Vroomen (Eds.), Informatiewetenschap 1994: Wetenschappelijke bijdragen aande derde STINFON conferentie (pp. 167–180). Leiden, Netherlands: Stichting Informatiewetenschap Nederland.
Kraaij, W., & Pohlmann, R. (1995). Evaluation of a Dutch stemming algorithm. The New Review of Document and Text Management, 1, 25–43.
Kuhlman, D. (2012). A python book: Beginning python, advanced python and python exercises. Rexx.com.
Lewis, M. P. (2009). Ethnologue: Languages of the World, Sixteenth edition. [online]. http://www.ethnologue.com/. Accessed 4 Dec 2012
Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1&2), 22–31.
Newman, P. (2000). The Hausa language: An encyclopedic reference grammar. New Haven: Yale University Press.
Newman, P. (2007). A Hausa-English dictionary. New Haven: Yale University Press.
Newman, R., & Newman, P. (2001). The Hausa lexicographic tradition. Lexikos, 11, 263–286.
Paice, C. D. (1990). Another stemmer. ACM, SIGIR Forum, 24(3), 56–61.
Paice, C. D. (1994). An evaluation method for stemming algorithms. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 42–50). New York, NY: Springer-Verlag.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130–137.
Schuh, R. G. (2012). A Hausa story and Hausa verb morphology UCLA [online]. http://www.linguistics.ucla.edu/people/schuh/lx105/. Accessed 4 Dec 2012
Sever, H., & Bitirim, Y. (2003). FindStem: Analysis and evaluation of a Turkish stemming algorithm. LNCS, 2857, 238–251.
Sirsat, S. R., Chavan, V., & Mahalle, H. S. (2013). Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies, 4(2), 265–269.
Smirnov, I. (2008). Overview of stemming algorithms. Mechanical Translation , 52.
Smirnova, M. (1982). The Hausa language a descriptive grammar. London: Routledge & Keagan Paul.
Solak, A., & Can, F. (1994). Effects of stemming on Turkish text retrieval. In Proceedings of the ninth international. Symposium on Computer and Information Sciences (ISCIS), pp. 49–56.
Acknowledgments
We gratefully acknowledge the support of Paul Newman, Indiana University USA, for the substantive comments and constructive criticisms given.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bimba, A., Idris, N., Khamis, N. et al. Stemming Hausa text: using affix-stripping rules and reference look-up. Lang Resources & Evaluation 50, 687–703 (2016). https://doi.org/10.1007/s10579-015-9311-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9311-x