skip to main content
research-article

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

Published: 13 December 2021 Publication History

Abstract

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

References

[1]
Fatma Al Shamsi and Ahmed Guessoum. 2006. A hidden Markov model-based POS tagger for Arabic. In Proceedings of the 8th International Conference on the Statistical Analysis of Textual Data. 31–42.
[2]
Firoj Alam, Shammur Absar Chowdhury, and Sheak Rashed Haider Noori. 2016. Bidirectional LSTMs—CRFs networks for Bangla POS tagging. In Proceedings of the 2016 19th International Conference on Computer and Information Technology. IEEE, 377–382.
[3]
Shabib AlGahtani, William Black, and John McNaught. 2009. Arabic part-of-speech tagging using transformation-based learning. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools. 66–70.
[4]
P. J. Antony, Santhanu P. Mohan, and K. P. Soman. 2010. SVM based part of speech tagger for Malayalam. In Proceedings of the 2010 International Conference on Recent Trends in Information, Telecommunication and Computing. IEEE, 339–341.
[5]
M. Choudhury, Kalika Bali, and P. Biswas. 2010. Indian language part-of-speech tagset: Bengali LDC2010T16. Linguistic Data Consortium, Philadelphia.
[6]
Sankaran Baskaran, Kalika Bali, Tanmoy Bhattacharya, Pushpak Bhattacharyya, Girish Nath Jha, S. Rajendran, K. Saravanan, L. Sobha, and K. V. Subbarao. 2008. Designing a common POS-tagset framework for Indian languages. In Proceedings of the 6th Workshop on Asian Language Resources.
[7]
Thorsten Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 224–231.
[8]
Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing. Association for Computational Linguistics, 152–155.
[9]
Eric Brill. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the 12th AAAI National Conference on Artificial Intelligence.
[10]
Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 4 (1995), 543–565.
[11]
Jason P. C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 1 (2016), 357–370.
[12]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.
[13]
Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin 70, 4 (1968), 213.
[14]
Chuang Ding, Lei Xie, Jie Yan, Weini Zhang, and Yang Liu. 2015. Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 98–102.
[15]
Asif Ekbal and Sivaji Bandyopadhyay. 2008. Part of speech tagging in bengali using support vector machine. In Proceedings of the 2008 International Conference on Information Technology. IEEE, 106–111.
[16]
Asif Ekbal, S. Mondal, and Sivaji Bandyopadhyay. 2007. POS tagging using HMM and rule-based chunking. The Proceedings of SPSAL 8, 1 (2007), 25–28.
[17]
Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3, Aug (2002), 115–143.
[18]
Jesús Giménez and Lluis Marquez. 2004. Fast and accurate part-of-speech tagging: The SVM approach revisited. Recent Advances in Natural Language Processing III (2004), 153–162.
[19]
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 273–278.
[20]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5–6 (2005), 602–610.
[21]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. (2015). arXiv:cs.CL/1508.01991.
[22]
Nisheeth Joshi, Hemant Darbari, and Iti Mathur. 2013. HMM based POS tagger for Hindi. In Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing.
[23]
Adam Kilgarriff. 2001. English lexical sample task description. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems. Association for Computational Linguistics, 17–20.
[24]
Adam Kilgarriff. 2001. English lexical sample task description. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. Association for Computational Linguistics, 17–20.
[25]
Mikael Kågebäck and Hans Salomonsson. 2016. Word Sense Disambiguation using a Bidirectional LSTM. arXiv:cs.CL/1606.03568.
[26]
John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning.
[27]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 260–270. https://doi.org/10.18653/v1/N16-1030
[28]
Yanzeng Li, Tingwen Liu, Diying Li, Quangang Li, Jinqiao Shi, and Yanqiu Wang. 2018. Character-based BiLSTM-CRF incorporating POS and dictionaries for Chinese opinion target extraction. In Proceedings of the Asian Conference on Machine Learning. PMLR, 518–533.
[29]
Saralin A. Lyngdoh. 2013. Empty Categories in Khasi. Ph.D. Dissertation. Delhi University.
[30]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1064–1074. https://doi.org/10.18653/v1/P16-1101
[31]
Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330.
[32]
Rada Mihalcea, Timothy Chklovski, and Adam Kilgarriff. 2004. The Senseval-3 English lexical sample task. In Proceedings of the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.
[33]
Tetsuji Nakagawa, Taku Kudo, and Yuji Matsumoto. 2001. Unknown word guessing and part-of-speech tagging using support vector machines. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposium. Citeseer, 325–331.
[34]
Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. 2014. RDRPOSTagger: A ripple down rules-based part-of-speech tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 17–20.
[35]
Dat Quoc Nguyen and Karin Verspoor. 2018. An improved neural network model for joint POS tagging and dependency parsing. In CoNLL. Association for Computational Linguistics.
[36]
Femphy Pisceldo, Ruli Manurung, and Mirna Adriani. 2009. Probabilistic part-of-speech tagging for Bahasa Indonesia. In Proceedings of the 3rd International MALINDO Workshop. 1–6.
[37]
Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 412–418. https://doi.org/10.18653/v1/P16-2067
[38]
Avinesh PVS and G. Karthik. 2007. Part-of-speech tagging and chunking using conditional random fields and transformation based learning. In Proceedings of Shallow Parsing for South Asian Languages (SPSAL’07). 21–24.
[39]
Thoudam D. Singh, Asif Ekbal, and Sivaji Bandyopadhyay. 2008. Manipuri POS tagging using CRF and SVM: A language independent approach. In Proceedings of 6th International Conference on Natural Language Processing. 240–245.
[40]
Fabio Tamburini. 2016. A BiLSTM-CRF PoS-tagger for Italian tweets using morphological information. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it’16) and Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Final Workshop (EVALITA’16), Napoli, Italy, December 5-7, 2016 (CEUR Workshop Proceedings), Pierpaolo Basile, Anna Corazza, Francesco Cutugno, Simonetta Monte.
[41]
Medari Janai Tham. 2012. Design considerations for developing a parts-of-speech tagset for Khasi. In Proceedings of the 2012 3rd National Conference on Emerging Trends and Applications in Computer Science. IEEE, 277–280.
[42]
Medari J. Tham. 2018. Challenges and issues in developing an annotated corpus and HMM POS tagger for Khasi. In Proceedings of the 15th International Conference on Natural Language Processing. NLP Association of India, Punjab (Patiala), 10–19.
[43]
Medari J. Tham. 2018. Khasi shallow parser. In Proceedings of the 15th International Conference on Natural Language Processing. NLP Association of India, Punjab (Patiala), 43–49.
[44]
Swati Tyagi and Gouri Shankar Mishra. 2016. Statistical analysis of part of speech (Pos) tagging algorithms for English corpus. International Journal of Advance Research, Ideas and Innovations in Technology 2, 3 (2016), 1–9.
[45]
Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan Musa. 2016. Supertagging with LSTMs. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 232–237.
[46]
Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao. 2015. Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network. CoRR abs/1510.06168. arXiv:1510.06168 http://arxiv.org/abs/1510.06168.
[47]
J. War, S. K. Singh, S. A. Lyngdoh, and B. Khyriem. 2016. Austro-Asiatic Linguistics of North-East India. EBH Publishers.
[48]
Sunita Warjri. 2020. Khasi-Corpus. Retrieved August 2020 from https://github.com/sunitawarjri/Khasi-Corpus/blob/master/Khasi%20Corpus.txt.
[49]
Sunita Warjri, Partha Pakray, Saralin Lyngdoh, and Arnab Kumar Maji. 2018. Khasi language as dominant part-of-speech (POS) ascendant in NLP. International Journal of Computational Intelligence & IoT 1, 1 (2018), 109–115.
[50]
Sunita Warjri, Partha Pakray, Saralin Lyngdoh, and Arnab Kumar Maji. 2019. Identification of POS tag for Khasi language based on hidden markov model POS tagger. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing. NLP Association of India.
[51]
Alfan Farizki Wicaksono and Ayu Purwarianti. 2010. HMM based part-of-speech tagger for Bahasa Indonesia. In Proceedings of the 4th International MALINDO Workshop.
[52]
Wikipedia contributors. 2018. Khasi Language—Wikipedia, The Free Encyclopedia. Retrieved February 2, 2018 from https://en.wikipedia.org/w/index.php?title=Khasi_language&oldid=838847215.
[53]
Wikipedia contributors. 2019. Austroasiatic Languages—Wikipedia, The Free Encyclopedia. Retrieved July 11, 2019 from https://en.wikipedia.org/w/index.php?title=Austroasiatic_languages&oldid=905433458.
[54]
Wikipedia contributors. 2019. Khasi Language—Wikipedia, The Free Encyclopedia. Retrieved January 15, 2019 from https://en.wikipedia.org/w/index.php?title=Khasi_language&oldid=914412473.
[55]
Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 3879–3889. https://aclanthology.org/C18-1327.
[56]
Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv:1603.06270.

Cited By

View all
  • (2024)A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent featuresPeerJ Computer Science10.7717/peerj-cs.257710(e2577)Online publication date: 11-Dec-2024
  • (2024)Abusive Language Detection in Khasi Social Media CommentsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3664285Online publication date: 14-May-2024
  • (2024)Part-of-speech Tagging for Low-resource Languages: Activation Function for Deep Learning Network to Work with Minimal Training DataACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365502323:5(1-31)Online publication date: 10-May-2024
  • Show More Cited By

Index Terms

  1. Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
        May 2022
        413 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3505182
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 December 2021
        Accepted: 01 September 2021
        Revised: 01 June 2021
        Received: 01 September 2019
        Published in TALLIP Volume 21, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Deep learning
        2. BiLSTM
        3. word embedding
        4. POS tagger
        5. ambiguity
        6. khasi language
        7. khasi corpus

        Qualifiers

        • Research-article
        • Refereed

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)122
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 25 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent featuresPeerJ Computer Science10.7717/peerj-cs.257710(e2577)Online publication date: 11-Dec-2024
        • (2024)Abusive Language Detection in Khasi Social Media CommentsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3664285Online publication date: 14-May-2024
        • (2024)Part-of-speech Tagging for Low-resource Languages: Activation Function for Deep Learning Network to Work with Minimal Training DataACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365502323:5(1-31)Online publication date: 10-May-2024
        • (2024)Leveraging Bidirectionl LSTM with CRFs for Pashto TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/364945623:4(1-17)Online publication date: 15-Apr-2024
        • (2024)Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained TransformersACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363787723:2(1-23)Online publication date: 8-Feb-2024
        • (2024)A System for Classifying Kazakh Language Documents: Morphological Analysis and Automatic Keyword Identification2024 IEEE 3rd International Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE)10.1109/PIERE62470.2024.10804927(1780-1786)Online publication date: 15-Nov-2024
        • (2023)Enhancing HMM-based POS tagger for Mizo languageJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22422045:6(11725-11736)Online publication date: 2-Dec-2023
        • (2023)Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based ApproachesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358890022:6(1-24)Online publication date: 16-Jun-2023
        • (2023)Fake news detection using social media data for Khasi language2023 International Conference on Intelligent Systems, Advanced Computing and Communication (ISACC)10.1109/ISACC56298.2023.10083518(1-6)Online publication date: 3-Feb-2023
        • (2023)UPoS Tagger for Low Resource Assamese Language: LSTM and BiLSTM based Modelling2023 IEEE International Conference on Machine Learning and Applied Network Technologies (ICMLANT)10.1109/ICMLANT59547.2023.10372865(1-6)Online publication date: 14-Dec-2023

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media