Abstract
Kannada is a phonetic language. In Kannada language, the morphological forms of terms (especially of nouns and verbs) are formed by adding different morphological suffixes to their pure forms. Hence, when queried for morphological forms, search engines based on exact matching fail to identify other semantically similar and morphologically different terms, and thus reduce the quality of the search results. We observe that even though the morphological forms of a term look different, they can be grouped together based on their common prefixes. In this work we propose fuzzy matching based indexing and retrieval algorithms. We propose an indexing mechanism inspired from prefix trees. We also derive our inspirations from the fact that the Unicode encodes the Kannada terms very similar to the way terms are generated using Kannada grammar. We also discuss a query term truncation and decayed score based retrieval algorithm for better retrieval of the documents for the given query. The indexing and retrieval systems still are based on the tf-idf based indexing and retrieval. However, the novelty of the work lies in the way the algorithms bring together the similar terms. This solution can be scaled to work for other South Indian languages with no or little modification as their Unicode encoding and morphological behaviors are similar to Kannada.
This work is a part of Kanaja project, conceptualised by Karnataka Jnana Ayoga.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kulkarni, S., Srinivasa, S.: A Novel IR Approach for Kannada Unicode Text. Technical Report, Open Systems Lab (2013), http://osl.iiitb.ac.in/reports/trieir_report.pdf
Bar-Ilan, J., Gutman, T.: How do search engines handle non-English queries?-A case study. WWW (Alternate Paper Tracks) (2003)
Singh, A.K., Surana, H., Gali, K.: More accurate fuzzy text search for languages using abugida scripts. In: Proceedings of ACM SIGIR Workshop on Improving Web Retrieval for Non-English Queries (2007)
Vikram, T.N., Urs, S.R.: Development of Prototype Morphological Analyzer for he South Indian Language of Kannada. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 109–116. Springer, Heidelberg (2007)
Singh, A.K.: A computational phonetic model for Indian language scripts. In: Constraints on Spelling Changes: Fifth International Workshop on Writing Systems (2006)
Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Kulkarni, S., Srinivasa, S. (2013). TrieIR: Indexing and Retrieval Engine for Kannada Unicode Text. In: Urs, S.R., Na, JC., Buchanan, G. (eds) Digital Libraries: Social Media and Community Networks. ICADL 2013. Lecture Notes in Computer Science, vol 8279. Springer, Cham. https://doi.org/10.1007/978-3-319-03599-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-03599-4_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03598-7
Online ISBN: 978-3-319-03599-4
eBook Packages: Computer ScienceComputer Science (R0)