skip to main content
research-article

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

Authors Info & Claims
Published:13 December 2021Publication History
Skip Abstract Section

Abstract

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

REFERENCES

  1. [1] Shamsi Fatma Al and Guessoum Ahmed. 2006. A hidden Markov model-based POS tagger for Arabic. In Proceedings of the 8th International Conference on the Statistical Analysis of Textual Data. 3142.Google ScholarGoogle Scholar
  2. [2] Alam Firoj, Chowdhury Shammur Absar, and Noori Sheak Rashed Haider. 2016. Bidirectional LSTMs—CRFs networks for Bangla POS tagging. In Proceedings of the 2016 19th International Conference on Computer and Information Technology. IEEE, 377382.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] AlGahtani Shabib, Black William, and McNaught John. 2009. Arabic part-of-speech tagging using transformation-based learning. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools. 6670.Google ScholarGoogle Scholar
  4. [4] Antony P. J., Mohan Santhanu P., and Soman K. P.. 2010. SVM based part of speech tagger for Malayalam. In Proceedings of the 2010 International Conference on Recent Trends in Information, Telecommunication and Computing. IEEE, 339341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bali M. Choudhury, Kalika, and Biswas P.. 2010. Indian language part-of-speech tagset: Bengali LDC2010T16. Linguistic Data Consortium, Philadelphia.Google ScholarGoogle Scholar
  6. [6] Baskaran Sankaran, Bali Kalika, Bhattacharya Tanmoy, Bhattacharyya Pushpak, Jha Girish Nath, Rajendran S., Saravanan K., Sobha L., and K. V. Subbarao. 2008. Designing a common POS-tagset framework for Indian languages. In Proceedings of the 6th Workshop on Asian Language Resources.Google ScholarGoogle Scholar
  7. [7] Brants Thorsten. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 224231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Brill Eric. 1992. A simple rule-based part of speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing. Association for Computational Linguistics, 152155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Brill Eric. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the 12th AAAI National Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Brill Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 4 (1995), 543565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chiu Jason P. C. and Nichols Eric. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 1 (2016), 357370.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Cohen Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 3746.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Cohen Jacob. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin 70, 4 (1968), 213.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Ding Chuang, Xie Lei, Yan Jie, Zhang Weini, and Liu Yang. 2015. Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 98102.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Ekbal Asif and Bandyopadhyay Sivaji. 2008. Part of speech tagging in bengali using support vector machine. In Proceedings of the 2008 International Conference on Information Technology. IEEE, 106111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Ekbal Asif, Mondal S., and Bandyopadhyay Sivaji. 2007. POS tagging using HMM and rule-based chunking. The Proceedings of SPSAL 8, 1 (2007), 2528.Google ScholarGoogle Scholar
  17. [17] Gers Felix A., Schraudolph Nicol N., and Schmidhuber Jürgen. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3, Aug (2002), 115143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Giménez Jesús and Marquez Lluis. 2004. Fast and accurate part-of-speech tagging: The SVM approach revisited. Recent Advances in Natural Language Processing III (2004), 153162.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Graves Alex, Jaitly Navdeep, and Mohamed Abdel-rahman. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 273278.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Graves Alex and Schmidhuber Jürgen. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5–6 (2005), 602610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. (2015). arXiv:cs.CL/1508.01991.Google ScholarGoogle Scholar
  22. [22] Joshi Nisheeth, Darbari Hemant, and Mathur Iti. 2013. HMM based POS tagger for Hindi. In Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Adam Kilgarriff. 2001. English lexical sample task description. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems. Association for Computational Linguistics, 17–20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Kilgarriff Adam. 2001. English lexical sample task description. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. Association for Computational Linguistics, 1720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Mikael Kågebäck and Hans Salomonsson. 2016. Word Sense Disambiguation using a Bidirectional LSTM. arXiv:cs.CL/1606.03568.Google ScholarGoogle Scholar
  26. [26] Lafferty John, McCallum Andrew, and Pereira Fernando C. N.. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Lample Guillaume, Ballesteros Miguel, Subramanian Sandeep, Kawakami Kazuya, and Dyer Chris. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 260–270. https://doi.org/10.18653/v1/N16-1030Google ScholarGoogle Scholar
  28. [28] Li Yanzeng, Liu Tingwen, Li Diying, Li Quangang, Shi Jinqiao, and Wang Yanqiu. 2018. Character-based BiLSTM-CRF incorporating POS and dictionaries for Chinese opinion target extraction. In Proceedings of the Asian Conference on Machine Learning. PMLR, 518533.Google ScholarGoogle Scholar
  29. [29] Lyngdoh Saralin A.. 2013. Empty Categories in Khasi. Ph.D. Dissertation. Delhi University.Google ScholarGoogle Scholar
  30. [30] Ma Xuezhe and Hovy Eduard. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1064–1074. https://doi.org/10.18653/v1/P16-1101Google ScholarGoogle Scholar
  31. [31] Marcus Mitchell, Santorini Beatrice, and Marcinkiewicz Mary Ann. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Mihalcea Rada, Chklovski Timothy, and Kilgarriff Adam. 2004. The Senseval-3 English lexical sample task. In Proceedings of the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google ScholarGoogle Scholar
  33. [33] Nakagawa Tetsuji, Kudo Taku, and Matsumoto Yuji. 2001. Unknown word guessing and part-of-speech tagging using support vector machines. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposium. Citeseer, 325331.Google ScholarGoogle Scholar
  34. [34] Nguyen Dat Quoc, Nguyen Dai Quoc, Pham Dang Duc, and Pham Son Bao. 2014. RDRPOSTagger: A ripple down rules-based part-of-speech tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 1720.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Nguyen Dat Quoc and Verspoor Karin. 2018. An improved neural network model for joint POS tagging and dependency parsing. In CoNLL. Association for Computational Linguistics.Google ScholarGoogle Scholar
  36. [36] Pisceldo Femphy, Manurung Ruli, and Adriani Mirna. 2009. Probabilistic part-of-speech tagging for Bahasa Indonesia. In Proceedings of the 3rd International MALINDO Workshop. 16.Google ScholarGoogle Scholar
  37. [37] Plank Barbara, Søgaard Anders, and Goldberg Yoav. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 412–418. https://doi.org/10.18653/v1/P16-2067Google ScholarGoogle Scholar
  38. [38] Avinesh PVS and G. Karthik. 2007. Part-of-speech tagging and chunking using conditional random fields and transformation based learning. In Proceedings of Shallow Parsing for South Asian Languages (SPSAL’07). 21–24.Google ScholarGoogle Scholar
  39. [39] Singh Thoudam D., Ekbal Asif, and Bandyopadhyay Sivaji. 2008. Manipuri POS tagging using CRF and SVM: A language independent approach. In Proceedings of 6th International Conference on Natural Language Processing. 240245.Google ScholarGoogle Scholar
  40. [40] Fabio Tamburini. 2016. A BiLSTM-CRF PoS-tagger for Italian tweets using morphological information. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it’16) and Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Final Workshop (EVALITA’16), Napoli, Italy, December 5-7, 2016 (CEUR Workshop Proceedings), Pierpaolo Basile, Anna Corazza, Francesco Cutugno, Simonetta Monte.Google ScholarGoogle Scholar
  41. [41] Tham Medari Janai. 2012. Design considerations for developing a parts-of-speech tagset for Khasi. In Proceedings of the 2012 3rd National Conference on Emerging Trends and Applications in Computer Science. IEEE, 277280.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Tham Medari J.. 2018. Challenges and issues in developing an annotated corpus and HMM POS tagger for Khasi. In Proceedings of the 15th International Conference on Natural Language Processing. NLP Association of India, Punjab (Patiala), 1019.Google ScholarGoogle Scholar
  43. [43] Tham Medari J.. 2018. Khasi shallow parser. In Proceedings of the 15th International Conference on Natural Language Processing. NLP Association of India, Punjab (Patiala), 4349.Google ScholarGoogle Scholar
  44. [44] Tyagi Swati and Mishra Gouri Shankar. 2016. Statistical analysis of part of speech (Pos) tagging algorithms for English corpus. International Journal of Advance Research, Ideas and Innovations in Technology 2, 3 (2016), 1–9.Google ScholarGoogle Scholar
  45. [45] Vaswani Ashish, Bisk Yonatan, Sagae Kenji, and Musa Ryan. 2016. Supertagging with LSTMs. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 232237.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao. 2015. Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network. CoRR abs/1510.06168. arXiv:1510.06168 http://arxiv.org/abs/1510.06168.Google ScholarGoogle Scholar
  47. [47] War J., Singh S. K., Lyngdoh S. A., and Khyriem B.. 2016. Austro-Asiatic Linguistics of North-East India. EBH Publishers.Google ScholarGoogle Scholar
  48. [48] Warjri Sunita. 2020. Khasi-Corpus. Retrieved August 2020 from https://github.com/sunitawarjri/Khasi-Corpus/blob/master/Khasi%20Corpus.txt.Google ScholarGoogle Scholar
  49. [49] Warjri Sunita, Pakray Partha, Lyngdoh Saralin, and Maji Arnab Kumar. 2018. Khasi language as dominant part-of-speech (POS) ascendant in NLP. International Journal of Computational Intelligence & IoT 1, 1 (2018), 109115.Google ScholarGoogle Scholar
  50. [50] Warjri Sunita, Pakray Partha, Lyngdoh Saralin, and Maji Arnab Kumar. 2019. Identification of POS tag for Khasi language based on hidden markov model POS tagger. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing. NLP Association of India.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wicaksono Alfan Farizki and Purwarianti Ayu. 2010. HMM based part-of-speech tagger for Bahasa Indonesia. In Proceedings of the 4th International MALINDO Workshop.Google ScholarGoogle Scholar
  52. [52] contributors Wikipedia. 2018. Khasi Language—Wikipedia, The Free Encyclopedia. Retrieved February 2, 2018 from https://en.wikipedia.org/w/index.php?title=Khasi_language&oldid=838847215.Google ScholarGoogle Scholar
  53. [53] contributors Wikipedia. 2019. Austroasiatic Languages—Wikipedia, The Free Encyclopedia. Retrieved July 11, 2019 from https://en.wikipedia.org/w/index.php?title=Austroasiatic_languages&oldid=905433458.Google ScholarGoogle Scholar
  54. [54] contributors Wikipedia. 2019. Khasi Language—Wikipedia, The Free Encyclopedia. Retrieved January 15, 2019 from https://en.wikipedia.org/w/index.php?title=Khasi_language&oldid=914412473.Google ScholarGoogle Scholar
  55. [55] Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 3879–3889. https://aclanthology.org/C18-1327.Google ScholarGoogle Scholar
  56. [56] Yang Zhilin, Salakhutdinov Ruslan, and Cohen William. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv:1603.06270.Google ScholarGoogle Scholar

Index Terms

  1. Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
          May 2022
          413 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3505182
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 December 2021
          • Accepted: 1 September 2021
          • Revised: 1 June 2021
          • Received: 1 September 2019
          Published in tallip Volume 21, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format