Annotating Sanskrit Corpus: Adapting IL-POSTS

Jha, Girish Nath; Gopal, Madhav; Mishra, Diwakar

doi:10.1007/978-3-642-20095-3_34

Annotating Sanskrit Corpus: Adapting IL-POSTS

Girish Nath Jha²⁰,
Madhav Gopal²¹ &
Diwakar Mishra²⁰

Conference paper

1081 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6562))

Abstract

In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b), developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the requirements of the Sanskrit data. A revision to the annotation guidelines done for IL-POSTS is also presented. The authors also present an experiment of training the tagger at MSRI and documenting the results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

AU-KBC tagset. AU-KBC POS tagset for Tamil, http://nrcfosshelpline.in/smedia/images/downloads/Tamil_Tagset-opensource.odt
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, G.N., Rajendran S., Saravanan K., Sobha L., Subbarao, K.V.S.: A Common Parts-of-Speech Tagset Framework for Indian Languages. In: LREC 2008 - 6th Language Resources and Evaluation Conference, Marrakech, Morocco, May 26-June1 (2008)
Google Scholar
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao, K.V.S.: Designing a Common POS-Tagset Framework for Indian Languages. In: The 6thWorkshop on Asian Language Resources, Hyderabad (January 2008)
Google Scholar
Cardona, G.: Pāṇini: His work and its traditions, Motilal Banarasidass, Delhi (1988)
Google Scholar
Chandrashekar, R.: POS Tagger for Sanskrit, Ph.D. thesis, Jawaharlal Nehru University (2007)
Google Scholar
Cloeren, J.: Tagsets. In: van Halteren, H. (ed.) Syntactic Wordclass Tagging. Kluwer Academic, Dordrecht (1999)
Google Scholar
Jha, G.N.: Generating nominal inflectional morphology in Sanskrit. In: SIMPLE 2004, IIT-Kharagpur Lecture Compendium, Shyama Printing Works, Kharagpur, WB (2004)
Google Scholar
Jha, G.N., Sobha, L., Mishra, D., Singh, S.K., Pralayankar, P.: Anaphors in Sanskrit. In: Johansson, C. (eds.) Proceedings of the Second Workshop on Anaphora Resolution (2008), vol. 2, Cambridge Scholars Publishing (2007) ISSN 1736-6305
Google Scholar
Jha, G.N., Mishra, S.K.: Semantic processing in Panini’s karaka system. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) Sanskrit Computational Linguistics. LNCS, vol. 5402, pp. 239–252. Springer, Heidelberg (2009)
Chapter Google Scholar
Greene, B.B., Rubin, G.M.: Automatic grammatical tagging of English. Department of Linguistics, Brown University, Providence, R.I (1981)
Google Scholar
Hardie, A.: The Computational Analysis of Morphosyntactic Categories in Urdu. PhD Thesis submitted to Lancaster University (2004)
Google Scholar
IIIT-Tagset. A Parts-of-Speech tagset for Indian Languages, http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
Gerard, H.: The Sanskrit Heritage Site, http://sanskrit.inria.fr/
Kale, M.R.: A Higher Sanskrit Grammar. MLBD Publishers, New Delhi (1995)
Google Scholar
Leech, G., Wilson, A.: Recommendations for the Morphosyntactic Annotation of Corpora. EAGLES Report EAG-TCWG-MAC/R (1996)
Google Scholar
Leech, G., Wilson, A.: Standards for Tag-sets. In: van Halteren, H. (ed.) Syntactic Wordclass Tagging. Kluwer Academic, Dordrecht (1999)
Google Scholar
Leech, G.: Grammatical Tagging. In: Garsire, R., Leech, G., McEnery, A. (eds.) Corpus Annotation: Linguistic Information for Computer Text Corpora. Longman, London (1997)
Google Scholar
Sudhir, M., Jha, G.N.: Identifying verb inflections in Sanskrit morphology. In: Proceedings of SIMPLE 2004, IIT Kharagpur (2005)
Google Scholar
NLPAI Contest-2006, http://ltrs.iiit.ac.in/nlpai_cntest06
Hellwig, O.: A Stochastic Lexical and POS Tagger for Sanskrit. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) Sanskrit Computational Linguistics. LNCS, vol. 5402, pp. 266–277. Springer, Heidelberg (2009)
Chapter Google Scholar
Ramkrishnamacharyulu, K.V.: Annotating Sanskrit Texts Based on Sabdabodha Systems. In: Kulkarni, A., Huet, G. (eds.) Sanskrit Computational Linguistics. LNCS (LNAI), vol. 5406, pp. 26–39. Springer, Heidelberg (2009)
Google Scholar
Santorini, B.: Part-of-speech tagging guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Dept. of Computer and Information Science, University of Pennsylvania (1990)
Google Scholar
Subash, C.: Sanskrit Subanta Recognizer and Analyzer, M.Phil dissertation submitted to Jawaharlal Nehru University (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Special Center for Sanskrit Studies, J.N.U., New Delhi, 110067, India
Girish Nath Jha & Diwakar Mishra
Center of Linguistics, School of Language Literature & Culture Studies, JNU, New Delhi, 110067, India
Madhav Gopal

Authors

Girish Nath Jha
View author publications
You can also search for this author in PubMed Google Scholar
Madhav Gopal
View author publications
You can also search for this author in PubMed Google Scholar
Diwakar Mishra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Mathematics and Computer Science, Adam Mickiewicz University in Poznan, ul. Umultowska 87, 61614, Poznan, Poland
Zygmunt Vetulani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jha, G.N., Gopal, M., Mishra, D. (2011). Annotating Sanskrit Corpus: Adapting IL-POSTS. In: Vetulani, Z. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2009. Lecture Notes in Computer Science(), vol 6562. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20095-3_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-20095-3_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20094-6
Online ISBN: 978-3-642-20095-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics