Abstract
In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b), developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the requirements of the Sanskrit data. A revision to the annotation guidelines done for IL-POSTS is also presented. The authors also present an experiment of training the tagger at MSRI and documenting the results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
AU-KBC tagset. AU-KBC POS tagset for Tamil, http://nrcfosshelpline.in/smedia/images/downloads/Tamil_Tagset-opensource.odt
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, G.N., Rajendran S., Saravanan K., Sobha L., Subbarao, K.V.S.: A Common Parts-of-Speech Tagset Framework for Indian Languages. In: LREC 2008 - 6th Language Resources and Evaluation Conference, Marrakech, Morocco, May 26-June1 (2008)
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao, K.V.S.: Designing a Common POS-Tagset Framework for Indian Languages. In: The 6thWorkshop on Asian Language Resources, Hyderabad (January 2008)
Cardona, G.: Pāṇini: His work and its traditions, Motilal Banarasidass, Delhi (1988)
Chandrashekar, R.: POS Tagger for Sanskrit, Ph.D. thesis, Jawaharlal Nehru University (2007)
Cloeren, J.: Tagsets. In: van Halteren, H. (ed.) Syntactic Wordclass Tagging. Kluwer Academic, Dordrecht (1999)
Jha, G.N.: Generating nominal inflectional morphology in Sanskrit. In: SIMPLE 2004, IIT-Kharagpur Lecture Compendium, Shyama Printing Works, Kharagpur, WB (2004)
Jha, G.N., Sobha, L., Mishra, D., Singh, S.K., Pralayankar, P.: Anaphors in Sanskrit. In: Johansson, C. (eds.) Proceedings of the Second Workshop on Anaphora Resolution (2008), vol. 2, Cambridge Scholars Publishing (2007) ISSN 1736-6305
Jha, G.N., Mishra, S.K.: Semantic processing in Panini’s karaka system. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) Sanskrit Computational Linguistics. LNCS, vol. 5402, pp. 239–252. Springer, Heidelberg (2009)
Greene, B.B., Rubin, G.M.: Automatic grammatical tagging of English. Department of Linguistics, Brown University, Providence, R.I (1981)
Hardie, A.: The Computational Analysis of Morphosyntactic Categories in Urdu. PhD Thesis submitted to Lancaster University (2004)
IIIT-Tagset. A Parts-of-Speech tagset for Indian Languages, http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
Gerard, H.: The Sanskrit Heritage Site, http://sanskrit.inria.fr/
Kale, M.R.: A Higher Sanskrit Grammar. MLBD Publishers, New Delhi (1995)
Leech, G., Wilson, A.: Recommendations for the Morphosyntactic Annotation of Corpora. EAGLES Report EAG-TCWG-MAC/R (1996)
Leech, G., Wilson, A.: Standards for Tag-sets. In: van Halteren, H. (ed.) Syntactic Wordclass Tagging. Kluwer Academic, Dordrecht (1999)
Leech, G.: Grammatical Tagging. In: Garsire, R., Leech, G., McEnery, A. (eds.) Corpus Annotation: Linguistic Information for Computer Text Corpora. Longman, London (1997)
Sudhir, M., Jha, G.N.: Identifying verb inflections in Sanskrit morphology. In: Proceedings of SIMPLE 2004, IIT Kharagpur (2005)
NLPAI Contest-2006, http://ltrs.iiit.ac.in/nlpai_cntest06
Hellwig, O.: A Stochastic Lexical and POS Tagger for Sanskrit. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) Sanskrit Computational Linguistics. LNCS, vol. 5402, pp. 266–277. Springer, Heidelberg (2009)
Ramkrishnamacharyulu, K.V.: Annotating Sanskrit Texts Based on Sabdabodha Systems. In: Kulkarni, A., Huet, G. (eds.) Sanskrit Computational Linguistics. LNCS (LNAI), vol. 5406, pp. 26–39. Springer, Heidelberg (2009)
Santorini, B.: Part-of-speech tagging guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Dept. of Computer and Information Science, University of Pennsylvania (1990)
Subash, C.: Sanskrit Subanta Recognizer and Analyzer, M.Phil dissertation submitted to Jawaharlal Nehru University (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jha, G.N., Gopal, M., Mishra, D. (2011). Annotating Sanskrit Corpus: Adapting IL-POSTS. In: Vetulani, Z. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2009. Lecture Notes in Computer Science(), vol 6562. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20095-3_34
Download citation
DOI: https://doi.org/10.1007/978-3-642-20095-3_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20094-6
Online ISBN: 978-3-642-20095-3
eBook Packages: Computer ScienceComputer Science (R0)