Skip to main content
Log in

Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi language is spoken by the indigenous people of the state of Meghalaya in India. This paper presents a work on Part-of-speech (POS) tagging for the Khasi language by using the Conditional Random Field (CRF) method. The main significance of this work, is to experiment with the CRF model for PoS tagging in the Khasi language. This method produces a reliable agreement on the features of the language. POS tagging for Khasi is essential for creating lemmatizers which are used to lessen a word to its root structure and the POS corpus or dataset can be used in other NLP applications. In this research work, we have designed a tag set and POS tagging corpus. Khasi does not have any standard POS corpus. Therefore, we have to build a Khasi corpus that consists of around 71,000 tokens. After feeding the Khasi corpus to the CRF model for learning, the system yields a testing accuracy of 92.12% and an F1-score of 0.91. The result is compared with few other state-of-art techniques. It is observed that our approach produces promising results in comparison with other techniques. In future, we will increase the size of the Khasi POS corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Agarwal, H., & Mani, A. (2006). Part of speech tagging and chunking with conditional random fields. In the Proceedings of NWAI workshop.

  • Ahmad, A., & Syam, B. (2014). Kashmir part of speech tagger using CRF. Computer Science, 3(3), 3.

    Google Scholar 

  • Barman, A. K., Sarmah, J., & Sarma, S. K. (2013). POS Tagging of Assamese language and performance analysis of CRF++ and fnTBL approaches. In: 2013 UKSim 15th international conference on computer modelling and simulation IEEE, pp. 476–479. Retrieved from https://doi.org/10.1109/UKSim.2013.91.

  • Behera, P. (2017). An experiment with the CRF++ parts of speech (POS) tagger for Odia. Language in India, 17(1), 18.

    MathSciNet  Google Scholar 

  • Brants, T. (2000). TnT: A statistical part-of-speech tagger. Sixth Applied Natural Language Processing Conference (Association for Computational Linguistics, Seattle, Washington, USA), (pp. 224–231). https://doi.org/10.3115/974147.974178.

  • Br, S., & Ramakanth Kumar, P. (2012). Kannada part-of-speech tagging with probabilistic classifiers. International Journal of Computer Applications, 48(17), 26.

    Article  Google Scholar 

  • CLE. (2020). Center for language engineering. Retrieved January 12, 2020, from https://www.cle.org.pk/

  • Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. Third Conference on Applied Natural Language Processing, 6, 133–140.

    Article  Google Scholar 

  • Ekbal, A., Haque, R., & Bandyopadhyay, S. (2007). Bengali part of speech tagging using conditional random field. In Proceedings of seventh international symposium on natural language processing (SNLP2007), (pp. 131–136).

  • Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged corpus and a tagger for Urdu. LREC, 2, 2938–2943.

    Google Scholar 

  • Khan, W., Daud, A., Nasir, J. A., Amjad, T., Arafat, S., Aljohani, N., et al. (2019). Urdu part of speech tagging using conditional random fields. Language Resources and Evaluation, 53(3), 331.

    Article  Google Scholar 

  • Krishnapriya, V., Sreesha, P., Harithalakshmi, T., Archana, T., & Vettath, J. N. (2014). Design of a POS tagger using conditional random fields for Malayalam. In 2014 first international conference on computational systems and communications (ICCSC) IEEE, (pp. 370–373). https://doi.org/10.1109/COMPSC.2014.7032680.

  • Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In proceedings of the eighteenth international conference on machine learning, (pp. 282–289)

  • Mawphor. (2017). Mawphor. Retrieved November 2017, June 2019, from https://www.mawphor.com/index.php/

  • Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155.

    Google Scholar 

  • Ojha, A. K., Behera, P., Singh, S., & Jha, G. N. (2015). Training & evaluation of pos taggers in indo-aryan languages: A case of Hindi, Odia and Bhojpuri. In the proceedings of 7th language & technology conference: human language technologies as a challenge for computer science and linguistics, (pp. 524–529).

  • Pallavi, K., & Pillai, A. S. (2016). Kannpos-Kannada parts of speech tagger using conditional random fields. In: Emerging research in computing, information, communication and applications. Springer (pp. 479–491). Retrieved from https://doi.org/10.1007/978-81-322-2553-9_43.

  • Pandian, S. L., & Geetha, T. V. (2009). CRF models for Tamil part of speech tagging and chunking. In W. Li & D. Mollá-Aliod (Eds.), Computer processing of oriental languages. Language technology for the knowledge-based economy (pp. 11–22). Berlin: Springer.

    Chapter  Google Scholar 

  • Patel, C., & Gali, K. (2008). Part-of-speech tagging for Gujarati using conditional random fields. In: Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages. Retrieved from https://www.aclweb.org/anthology/I08-3019.

  • Pvs, A., & Karthik, G. (2007). Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Shallow Parsing for South Asian Languages, 21, 21.

    Google Scholar 

  • Sharma, S. K. (2016). Assigning the correct word class to Punjabi unknown words using CRF. International Journal of Computer Applications, 142, 14. https://doi.org/10.5120/ijca2016909684.

    Article  Google Scholar 

  • Singh, T. D., & Ekbal, A. (2008). Manipuri POS tagging using CRF and SVM: A language independent approach. In proceeding of 6th international conference on natural language processing (ICON-2008), (pp. 240–245)

  • Suraksha, N., Reshma, K., & Kumar, K. S. (2017). Part-of-speech tagging and parsing of Kannada text using Conditional Random Fields (CRFs). In: 2017 international conference on intelligent computing and control (I2C2) IEEE, (pp. 1–5). Retrieved from https://doi.org/10.1109/I2C2.2017.8321833.

  • Tham, M. J. (2018). Challenges and Issues in Developing an Annotated Corpus and HMM POS Tagger for Khasi. In the 15th international conference on natural language processing, (pp. 10–19).

  • Warjri, S. (2020). Khasi corpus. Retrieved from https://github.com/sunitawarjri/Khasi-Corpus/blob/master/Khasi%20Corpus.txt.

  • Warjri, S., Pakray, P., Lyngdoh, S., & Maji, A. K. (2021). Adopting conditional random field (CRF) for Khasi part-of-speech tagging (KPOST). In proceedings of the international conference on computing and communication systems: I3CS 2020, NEHU, Shillong, India, vol. 170 (Springer Nature), vol. 170, p. 75.

  • Warjri, S., Pakray, P., Lyngdoh, S., & Kumar Maji, A. (2018). Khasi language as dominant part-of-speech (POS) ascendant in NLP. International Journal of Computational Intelligence & IoT, 1(1), 109.

    Google Scholar 

  • Warjri, S., Pakray, P., Lyngdoh, S., & Maji, A. K. (2019). Identification of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger. Computación y Sistemas, 23(3), 795. https://doi.org/10.13053/CyS-23-3-3248.

    Article  Google Scholar 

  • Wikipedia contributors. (2020a). Bengali language: Wikipedia, the free encyclopedia. Retrieved February 02, 2020, from https://en.wikipedia.org/w/index.php?title=Bengali-language&oldid=941772762.

  • Wikipedia contributors. (2020b). Assamese language: Wikipedia, the free encyclopedia. Retrieved February 02, 2020, from https://en.wikipedia.org/w/index.php?title=Assamese-language&oldid=939154061.

  • Wikipedia contributors. (2020c). Gujarati language: Wikipedia, the free encyclopedia. Retrieved February 03, 2020, from https://en.wikipedia.org/w/index.php?title=Gujarati-language&oldid=942374083

  • Wikipedia contributors. (2020d). Kannada: Wikipedia, the free encyclopedia. Retrieved February 05, 2020, from https://en.wikipedia.org/w/index.php?title=Kannada&oldid=942703407.

  • Wikipedia contributors. (2020e). Kashmiri language: Wikipedia, the free encyclopedia. Retrieved February 04, 2020, from https://en.wikipedia.org/w/index.php?title=Kashmiri-language&oldid=942627183.

  • Wikipedia contributors. (2020f). Malayalam: Wikipedia, the free encyclopedia. Retrieved February 03, 2020 from https://en.wikipedia.org/w/index.php?title=Malayalam&oldid=941882964

  • Wikipedia contributors. (2020g). Meitei language: Wikipedia, the free encyclopedia. Retrieved February 02, 2020, from https://en.wikipedia.org/w/index.php?title=Meitei-language&oldid=936096557

  • Wikipedia contributors. (2020h). Odia language: Wikipedia, the free encyclopedia. Retrieved February 03, 2020, from https://en.wikipedia.org/w/index.php?title=Odia-language&oldid=941768688

  • Wikipedia contributors. (2020i). Punjabi language: Wikipedia, the free encyclopedia. Retrieved February 02, 2020, from https://en.wikipedia.org/w/index.php?title=Punjabi-language&oldid=941520253

  • Wikipedia contributors. (2020j). Tamil language: Wikipedia, the free encyclopedia. Retrieved February 04, 2020, from https://en.wikipedia.org/w/index.php?title=Tamil_language&oldid=941234813

  • Wikipedia contributors. (2020k). Hindi: Wikipedia, the free encyclopedia. Retrieved February 03, 2020, from https://en.wikipedia.org/w/index.php?title=Hindi&oldid=942598408.

  • Wikipedia contributors. (2020l). Urdu: Wikipedia, the free encyclopedia. Retrieved February 02, 2020, from https://en.wikipedia.org/w/index.php?title=Urdu&oldid=942705946

  • Wikipedia contributors. (2020m). Khasi: Wikipedia, the free encyclopedia. Retrieved January 15, 2020, from https://en.wikipedia.org/w/index.php?title=Khasi_language&oldid=914412473

  • Zhang, C., Wang, H., Liu, Y., Wu, D., Liao, Y., & Wang, B. (2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 4(3), 1169.

    Google Scholar 

Download references

Acknowledgements

Authors would like to thanks and acknowledge the Government of India, Ministry of Science & Technology, Department of Science & Technology (DST), KIRAN Division, Technology Bhavan, New Delhi for the financial assistance (Grant: DST/WOS-B/2018/1216/ETD/Sunita(G)) during the study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnab Kumar Maji.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Warjri, S., Pakray, P., Lyngdoh, S.A. et al. Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora. Int J Speech Technol 24, 853–864 (2021). https://doi.org/10.1007/s10772-021-09860-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09860-w

Keywords

Navigation