Abstract
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.
- [1] . 2006. A hidden Markov model-based POS tagger for Arabic. In Proceedings of the 8th International Conference on the Statistical Analysis of Textual Data. 31–42.Google Scholar
- [2] . 2016. Bidirectional LSTMs—CRFs networks for Bangla POS tagging. In Proceedings of the 2016 19th International Conference on Computer and Information Technology. IEEE, 377–382.Google ScholarCross Ref
- [3] . 2009. Arabic part-of-speech tagging using transformation-based learning. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools. 66–70.Google Scholar
- [4] . 2010. SVM based part of speech tagger for Malayalam. In Proceedings of the 2010 International Conference on Recent Trends in Information, Telecommunication and Computing. IEEE, 339–341. Google ScholarDigital Library
- [5] . 2010. Indian language part-of-speech tagset: Bengali LDC2010T16. Linguistic Data Consortium, Philadelphia.Google Scholar
- [6] . 2008. Designing a common POS-tagset framework for Indian languages. In Proceedings of the 6th Workshop on Asian Language Resources.Google Scholar
- [7] . 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 224–231. Google ScholarDigital Library
- [8] . 1992. A simple rule-based part of speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing. Association for Computational Linguistics, 152–155. Google ScholarDigital Library
- [9] . 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the 12th AAAI National Conference on Artificial Intelligence. Google ScholarDigital Library
- [10] . 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 4 (1995), 543–565. Google ScholarDigital Library
- [11] . 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 1 (2016), 357–370.Google ScholarCross Ref
- [12] . 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.Google ScholarCross Ref
- [13] . 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin 70, 4 (1968), 213.Google ScholarCross Ref
- [14] . 2015. Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 98–102.Google ScholarCross Ref
- [15] . 2008. Part of speech tagging in bengali using support vector machine. In Proceedings of the 2008 International Conference on Information Technology. IEEE, 106–111. Google ScholarDigital Library
- [16] . 2007. POS tagging using HMM and rule-based chunking. The Proceedings of SPSAL 8, 1 (2007), 25–28.Google Scholar
- [17] . 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3, Aug (2002), 115–143. Google ScholarDigital Library
- [18] . 2004. Fast and accurate part-of-speech tagging: The SVM approach revisited. Recent Advances in Natural Language Processing III (2004), 153–162.Google ScholarCross Ref
- [19] . 2013. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 273–278.Google ScholarCross Ref
- [20] . 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5–6 (2005), 602–610. Google ScholarDigital Library
- [21] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. (2015). arXiv:cs.CL/1508.01991.Google Scholar
- [22] . 2013. HMM based POS tagger for Hindi. In Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing.Google ScholarCross Ref
- [23] Adam Kilgarriff. 2001. English lexical sample task description. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems. Association for Computational Linguistics, 17–20. Google ScholarDigital Library
- [24] . 2001. English lexical sample task description. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. Association for Computational Linguistics, 17–20. Google ScholarDigital Library
- [25] Mikael Kågebäck and Hans Salomonsson. 2016. Word Sense Disambiguation using a Bidirectional LSTM. arXiv:cs.CL/1606.03568.Google Scholar
- [26] . 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. Google ScholarDigital Library
- [27] . 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 260–270. https://doi.org/10.18653/v1/N16-1030Google Scholar
- [28] . 2018. Character-based BiLSTM-CRF incorporating POS and dictionaries for Chinese opinion target extraction. In Proceedings of the Asian Conference on Machine Learning. PMLR, 518–533.Google Scholar
- [29] . 2013. Empty Categories in Khasi. Ph.D. Dissertation. Delhi University.Google Scholar
- [30] . 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1064–1074. https://doi.org/10.18653/v1/P16-1101Google Scholar
- [31] . 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330. Google ScholarDigital Library
- [32] . 2004. The Senseval-3 English lexical sample task. In Proceedings of the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google Scholar
- [33] . 2001. Unknown word guessing and part-of-speech tagging using support vector machines. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposium. Citeseer, 325–331.Google Scholar
- [34] . 2014. RDRPOSTagger: A ripple down rules-based part-of-speech tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 17–20.Google ScholarCross Ref
- [35] . 2018. An improved neural network model for joint POS tagging and dependency parsing. In CoNLL. Association for Computational Linguistics.Google Scholar
- [36] . 2009. Probabilistic part-of-speech tagging for Bahasa Indonesia. In Proceedings of the 3rd International MALINDO Workshop. 1–6.Google Scholar
- [37] . 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 412–418. https://doi.org/10.18653/v1/P16-2067Google Scholar
- [38] Avinesh PVS and G. Karthik. 2007. Part-of-speech tagging and chunking using conditional random fields and transformation based learning. In Proceedings of Shallow Parsing for South Asian Languages (SPSAL’07). 21–24.Google Scholar
- [39] . 2008. Manipuri POS tagging using CRF and SVM: A language independent approach. In Proceedings of 6th International Conference on Natural Language Processing. 240–245.Google Scholar
- [40] Fabio Tamburini. 2016. A BiLSTM-CRF PoS-tagger for Italian tweets using morphological information. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it’16) and Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Final Workshop (EVALITA’16), Napoli, Italy, December 5-7, 2016 (CEUR Workshop Proceedings), Pierpaolo Basile, Anna Corazza, Francesco Cutugno, Simonetta Monte.Google Scholar
- [41] . 2012. Design considerations for developing a parts-of-speech tagset for Khasi. In Proceedings of the 2012 3rd National Conference on Emerging Trends and Applications in Computer Science. IEEE, 277–280.Google ScholarCross Ref
- [42] . 2018. Challenges and issues in developing an annotated corpus and HMM POS tagger for Khasi. In Proceedings of the 15th International Conference on Natural Language Processing. NLP Association of India, Punjab (Patiala), 10–19.Google Scholar
- [43] . 2018. Khasi shallow parser. In Proceedings of the 15th International Conference on Natural Language Processing. NLP Association of India, Punjab (Patiala), 43–49.Google Scholar
- [44] . 2016. Statistical analysis of part of speech (Pos) tagging algorithms for English corpus. International Journal of Advance Research, Ideas and Innovations in Technology 2, 3 (2016), 1–9.Google Scholar
- [45] . 2016. Supertagging with LSTMs. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 232–237.Google ScholarCross Ref
- [46] Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao. 2015. Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network. CoRR abs/1510.06168. arXiv:1510.06168 http://arxiv.org/abs/1510.06168.Google Scholar
- [47] . 2016. Austro-Asiatic Linguistics of North-East India. EBH Publishers.Google Scholar
- [48] . 2020. Khasi-Corpus. Retrieved August 2020 from https://github.com/sunitawarjri/Khasi-Corpus/blob/master/Khasi%20Corpus.txt.Google Scholar
- [49] . 2018. Khasi language as dominant part-of-speech (POS) ascendant in NLP. International Journal of Computational Intelligence & IoT 1, 1 (2018), 109–115.Google Scholar
- [50] . 2019. Identification of POS tag for Khasi language based on hidden markov model POS tagger. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing. NLP Association of India.Google ScholarCross Ref
- [51] . 2010. HMM based part-of-speech tagger for Bahasa Indonesia. In Proceedings of the 4th International MALINDO Workshop.Google Scholar
- [52] . 2018. Khasi Language—Wikipedia, The Free Encyclopedia. Retrieved February 2, 2018 from https://en.wikipedia.org/w/index.php?title=Khasi_language&oldid=838847215.Google Scholar
- [53] . 2019. Austroasiatic Languages—Wikipedia, The Free Encyclopedia. Retrieved July 11, 2019 from https://en.wikipedia.org/w/index.php?title=Austroasiatic_languages&oldid=905433458.Google Scholar
- [54] . 2019. Khasi Language—Wikipedia, The Free Encyclopedia. Retrieved January 15, 2019 from https://en.wikipedia.org/w/index.php?title=Khasi_language&oldid=914412473.Google Scholar
- [55] Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 3879–3889. https://aclanthology.org/C18-1327.Google Scholar
- [56] . 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv:1603.06270.Google Scholar
Index Terms
- Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
Recommendations
Deep Learning Based Unsupervised POS Tagging for Sanskrit
ACAI '18: Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial IntelligenceIn this paper, we present a deep learning based approach to assign POS tags to words in a piece of text given to it as input. We propose an unsupervised approach owing to the lack of a large Sanskrit annotated corpora and use the untagged Sanskrit ...
Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora
AbstractKhasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi language is spoken by the indigenous people of the state of Meghalaya in India. This paper presents a work on Part-of-speech (POS) tagging for the Khasi ...
Toward an Effective Igbo Part-of-Speech Tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
Comments