Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri

Behera, Pitambar; Ojha, Atul Kr.; Jha, Girish Nath

doi:10.1007/978-3-319-93782-3_28

Pitambar Behera¹⁶,
Atul Kr. Ojha¹⁷ &
Girish Nath Jha^16,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10930))

Included in the following conference series:

Language and Technology Conference

583 Accesses

Abstract

Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collection and annotation of a voluminous corpus for the purpose of NLP application for these languages prove to be quite challenging. For the development of any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo-Aryan language. A corpus of approximately 121 k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpora Initiative) Project. Both the taggers are trained and tested with approximately 80 k and 13 k respectively. The SVM tagger provides 83% accuracy while the CRF++ has 71.56% which is less in comparison to the former.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

POS Tagging and Less Resources Languages Individuated Features in CorpusWiki

Using machine learning to build POS tagger for under-resourced language: the case of Somali

Article 03 June 2020

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

Notes

1.
https://www.ethnologue.com/language/spv.
2.
However there is no satisfactory explanation about the methodology adopted and the number of lexical items analysed on the basis of which this conclusion has been arrived at.
3.
http://www.glossary.sil.org/term/agglutinative-language.
4.
This is the very first POS tagset developed for Sambalpuri.
5.
http://sanskrit.jnu.ac.in/ilciann/index.jsp.
6.
https://22bc339da9ca3e2462414546a715752e4c2c5e0d.googledrive.com/host/0B5rBGd680WZFemVLa3RxY0preE0/AkrutiUnicode.

References

McEnery, T., Baker, P., Burnard, L.: Corpus resources and minority language engineering. In: LREC (2000)
Google Scholar
Ostler, N.: Language technology and the smaller language. ELRA Newsl. 4(2) (1999)
Google Scholar
Abbi, A.: A Manual of Linguistic Fieldwork and Structures of Indian Languages, vol. 17. Lincom Europa (2001)
Google Scholar
Kushal, G.: Case and agreement in Sambalpuri. M. Phil. Thesis, Centre for Linguistics, Jawaharlal Nehru University, New Delhi, Delhi (2015)
Google Scholar
Mathai, E.K., Kelsall, J.: Sambalpuri of Orissa, India: A Brief Sociolinguistic Survey. SIL International (2013)
Google Scholar
Tripathy, B.: Sambalpuri semantics. Graduate Thesis, Sambalpur University, Sambalpur (1984)
Google Scholar
Behera, P. Dash, B.N.: Documenting Sambalpuri-Kosli: the case of a less-resourced language. Indian J. Appl. Linguist. (IJOAL). Bahri Publications (0379-0037), June 2017. (accepted)
Google Scholar
Padhy, H.H., Mohanty, S.: Designing hybrid approach spell checker for Oriya. Int. J. Latest Trends Eng. Technol. 2(4), 156–160 (2013)
Google Scholar
Jena, I., Chaudhury, S., Chaudhry, H., Sharma, Dipti M.: Developing Oriya morphological analyzer using Lt-Toolbox. In: Singh, C., Singh Lehal, G., Sengupta, J., Sharma, D.V., Goyal, V. (eds.) ICISIL 2011. CCIS, vol. 139, pp. 124–129. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19403-0_20
Chapter Google Scholar
Behera, P.: Odia parts of speech tagging corpora: suitability of statistical models. M. Phil. Thesis, Centre for Linguistics, Jawaharlal Nehru University, Delhi (2015)
Google Scholar
Behera, P.: Evaluation of SVM-based automatic parts of speech tagger for Odia. In: Proceedings of WILDRE-3 (LREC-2016), Portoroz, Slovenia, pp. 32–38 (2016). ISBN: 978-2-9517408-8-4
Google Scholar
Behera, P.: An experimentation with the CRF++ parts of speech tagger for Odia. Lang. India 17(1) (2017). ISSN: 1930-2940
Google Scholar
Ojha, A.K., Behera, P., Singh, S., Jha, G.N.: Training & evaluation of POS taggers in Indo-Aryan languages: a case of Hindi, Odia and Bhojpuri. In: Proceedings of LTC-2015, Poland, pp. 524–529 (2015)
Google Scholar
Behera, P.: Issues and challenges in corpus collection and annotation of Sambalpuri: the case of a lesser-known language. Language Forum, Bahri Publications, June 2018. ISSN 0253-9071. (accepted)
Google Scholar
Bhattacharya, T.: The structure of the Bangla DP. Doctoral Dissertation, University College, London (1999)
Google Scholar
Neukom, L., Patnaik, M.: A Grammar of Oriya. Seminar für Allgemeine Sprachwissenschaft der University, Zürich (2003)
Google Scholar
Shukla, S.: Bhojpuri Grammar. Georgetown University Press, Washington, D.C. (1981)
Google Scholar
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Jha, G.N.: A common parts-of-speech tagset framework for Indian languages. In: LREC (2008)
Google Scholar
Abbi, A.: Reduplication in South Asian Languages: An Areal, Typological and Historical Study. Allied Publishers Pvt. Ltd., Chennai (1992)
Google Scholar
Jha, G.N., Hellan, L., Beermann, D., Singh, S., Behera, P., Banerjee, E. Indian languages on the TypeCraft platform - the case of Hindi and Odia. In: LREC, Iceland (2014)
Google Scholar
Kumar, R., Kaushik, S., Nainwani, P., Banerjee, E., Hadke, S., Jha, G.N.: Using the ILCI annotation tool for POS annotation: a case of Hindi. In: 13th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2012), New Delhi, India, March 2012
Google Scholar
Joachims, T.: Making large scale SVM learning practical. Universität Dortmund (1999)
Google Scholar
Giménez, J., Màrquez, L.: Technical Manual v1.3. Universitat Politecnica de Catalunya, Barcelona (2006)
Google Scholar
Kudo, T.: CRF ++: Yet Another CRF Toolkit (2013). http://crfpp.sourceforge.net/ptojrcts/crfpp/. Accessed 10 July 2015
Patel, K.: A Sambalpuri Phonetic Reader. Menaka Prakashani, Sambalpur (undated)
Google Scholar
Masica, C.P.: The Indo-Aryan Languages. Cambridge University Press, Cambridge (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Linguistics, Jawaharlal Nehru University, New Delhi, India
Pitambar Behera & Girish Nath Jha
Special Centre for Sanskrit Studies, Jawaharlal Nehru University, New Delhi, India
Atul Kr. Ojha & Girish Nath Jha

Authors

Pitambar Behera
View author publications
You can also search for this author in PubMed Google Scholar
Atul Kr. Ojha
View author publications
You can also search for this author in PubMed Google Scholar
Girish Nath Jha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pitambar Behera .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
LIMSI-CNRS, Orsay Cedex, France
Joseph Mariani
Adam Mickiewicz University, Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Behera, P., Ojha, A.K., Jha, G.N. (2018). Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-93782-3_28
Published: 16 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93781-6
Online ISBN: 978-3-319-93782-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

POS Tagging and Less Resources Languages Individuated Features in CorpusWiki

Using machine learning to build POS tagger for under-resourced language: the case of Somali

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

POS Tagging and Less Resources Languages Individuated Features in CorpusWiki

Using machine learning to build POS tagger for under-resourced language: the case of Somali

Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation