Skip to main content
Log in

Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Massive amounts of unstructured content have been generated day-by-day on social media platforms like Facebook, Twitter and blogs. Analyzing and extracting useful information from this vast amount of text content is a challenging process. Social media have currently provided extensive opportunities for researchers and practitioners to do adequate research on this area. Most of the text content in social media tend to be either in English or code-mixed regional languages. In a multilingual country like India, code-mixing is the usual fashion witnessed in social media discussions. Multilingual users frequently use Roman script, an convenient mode of expression, instead of the regional language script for posting messages on social media and often mix it with English into their native languages. Stylistic and grammatical irregularities are significant challenges in processing the code-mixed text using conventional methods. This paper explains the new word embedding via character level representation as features for POS tagging the code-mixed text in Indian languages using the ICON-2015, ICON-2016 NLP tools contest data set. The proposed word embedding features are context-appended, and the well-known Support Vector Machine (SVM) classifier has been used to train the system. We have combined the Facebook, Twitter, and WhatsApp code-mixed data of three Indian languages to train the Transfer learning based language-independent and source independent POS tagging. The experimental results demonstrated that the proposed transfer method achieved state-of-the-art accuracy in 12 systems out of 18 systems for the ICON data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The term ”Code-Mixed Cross-Script” referred to as code-mixed script throughout this paper.

  2. Even though we have only used the data set provided by the task organizers. We considered our task submission as unconstrained because the data set of other languages and other sources is used to learn word embedding and character embedding.

  3. Constrained means the participant team is only allowed to use only the corpus given by the organizer for the training. No external resources are allowed.

  4. Unconstrained means the participant team can use any external resource (available POS tagger, NER, Parser, and any additional data) to train their system.

  5. Stylistic features used in Constraint and Word2vec is used in Unconstrained model.

References

  • (2016) Part of speech tagging for code switched data. In: Proceedings of the second workshop on computational approaches to code wwitching, pages 98–107

  • Adithya P, Monojit C, Sunayana S (2018) Word embeddings for code-mixed language processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3067–3072

  • Ali MM, Ranjha MI, Fakhar Sj (2010) Effects of code mixing in indian film songs. J Media Stud 31–2:2010

    Google Scholar 

  • Anand KM, Rajendran S, Soman KP (2015) Cross-lingual preposition disambiguation for machine translation. In: Eleventh international conference on data mining and warehousing, ICDMW 2015, volume 54, pages 291–300. Elsevier-Procedia Computer Science

  • Anand KM, Soman KP (2015) Amrita_cen@ icon-2015: Part-of-speech tagging on indian language mixed scripts in social media. In: ICON-NLP tools contest report at ICON

  • Anupam J, Amitava D (2016). Part-of-speech tagging system for indian social media text on twitter. In Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014), pages 21–28

  • Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S (2017) Named entity recognition on code-mixed cross-script social media content. Computación y Sistemas 21(4):681–692

    Google Scholar 

  • Björn G, Amitava D (2014). On measuring the complexity of code-mixing. In: Proceedings of the 11th international conference on natural language processing, Goa, India, pages 1–7

  • Chakma K (2014) Revisiting automatic transliteration problem for code-mixed romanized indian social media text. In: Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014), volume 2014, page 42

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(Aug):2493–2537

  • Dong N, Doğruöz SA (2013) Word level language identification in online multilingual communication. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 857–862

  • Huang Eric H, Richard S, Manning Christopher D, Ng Andrew Y (2012). Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers-Volume 1, pages 873–882. Association for Computational Linguistics

  • Jain D, Kumar A, Garg G Sarcasm detection in mash-up language using soft-attention based bi-directional lstm and feature-rich cnn. Applied Soft Computing, 91:106198, 2020. ISSN 1568-4946. https://doi.org/10.1016/j.asoc.2020.106198. URL http://www.sciencedirect.com/science/article/pii/S1568494620301381

  • Jamatia, A and Amitava D(2014) Part-of-speech tagging system for indian social media text on twitter. In: Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014), pages 21–28

  • Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel Methods - Support Vector Learning, vol chapter 11. MIT Press, Cambridge, MA, pp 169–184

  • Joseph R, Mooney Raymond J (2010) Multi-prototype vector-space models of word meaning. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 109–117

  • Kamal S (2015). Part-of-speech tagging for code-mixed indian social media text at icon 2015. In ICON-NLP tools contest report, at the Twelfth International Conference on Natural Language Processing (ICON-2015)

  • Kelsey B, Dan G (2018) PPart-of-speech tagging for code-switched, transliterated texts without explicit language identification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3084–3089

  • Kumar, M Anand and Soman KP (2014). Amrita-cen@ fire-2014: Morpheme extraction and lemmatization for tamil using machine learning. In: ACM International Conference Proceeding Series, pages 112–20

  • Le Q, Mikolov T (2014). Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pages 1188–1196

  • Lin W, Haitao L (2016). Syntactic differences of adverbials and attributives in chinese-english code-switching. Language Sciences, 55:16 – 35. ISSN 0388-0001

  • Mikolov T, Chen K, Corrado G, Dean J (2014) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2014

  • Mushtaq Hammad, Zahra Taskeen (2012) An analysis of code-mixing in television commercials. Lang India 12–11:2012

    Google Scholar 

  • Myers-Scotton Carol (2002) Bilingual speech, a typology of code-mixing. Language 78(2):330–333

    Article  Google Scholar 

  • Nelakuditi K, Jitta DS, Mamidi R (2018) Part-of-speech tagging for code mixed english-telugu social media data. In: Gelbukh Alexander (ed) Computational linguistics and intelligent text processing. Springer International Publishing, Cham, pp 332–342

    Chapter  Google Scholar 

  • Partha P, Goutam M, Amarnath P (2018). An hmm based pos tagger for pos tagging of code-mixed indian social media text. In Jyotsna Kumar Mandal and Devadatta Sinha, editors, Social Transformation–Digital Way, pages 495–504, Singapore. Springer Singapore

  • Parth G, Kalika B, Banchs Rafael E, Monojit C, Paolo R (2014) Query expansion for mixed-script information retrieval. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 677–686. ACM

  • Raveesh Motlan A, Sharma. (2015) Pos tagging for code-mixed indian social media text : Systems from iiit-h for icon-nlp tools contest. In ICON-NLP tools contest report, at the Twelfth international conference on natural language processing (ICON-2015)

  • Sampathkumar A, Ravi R, Srinivas A, Achyut S, Sandeep K, Sivaram M (2020) An efficient hybrid methodology for detection of cancer-causing gene using csc for micro array data. Journal of Ambient Intelligence and Humanized. Computing 1–9

  • Santos Cicero D and Bianca Z (2014). Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st international conference on machine learning (ICML-14), pages 1818–1826

  • Sequiera R, Choudhury M, Bali K (2015). Pos tagging of hindi-english code mixed text from social media: Some machine learning experiments. In: 12th International Conference on Natural Language Processing, page 233

  • Sharma Kalika Bali Jatin, Choudhury Monojit, Vyas Yogarshi (2014) “i am borrowing ya mixing?” an analysis of english-hindi code mixing in facebook. EMNLP page 116:2014

  • Solorio Thamar, Liu Yang (2008). Part-of-speech tagging for english-spanish code-switched text. In:d Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1051–1060. Association for Computational Linguistics

  • Souvick G, Satanu G, Dipankar D(2016) Part-of-speech tagging of code-mixed social media text. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pages 90–97

  • Spandana G, Jatin S, Kalika B (2013) Query word labeling and back transliteration for indian languages: Shared task system description. FIRE Working Notes -2013

  • Vyas Y, Gella S, Sharma J, Bali K, Choudhury Monojit (2014) Pos tagging of English-Hindi code-mixed social media content. In EMNLP 14:974–979

    Google Scholar 

  • Wen SC, Min C, Chen WC(2018). Analyzing the trend of o2o commerce by bilingual text mining on social media. Computers in Human Behavior, page in press

  • Xiang Z, Junbo Z, Yann L (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657

Download references

Acknowledgements

We would like to thank ICON-2015 and ICON-2016 tools contest organizers for organizing the NLP event in India. We also like to thank Dr. Amitav Das for initiating this research along with the tools contest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Kumar Madasamy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Madasamy, A.K., Padannayil, S.K. Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages. J Ambient Intell Human Comput 14, 7207–7218 (2023). https://doi.org/10.1007/s12652-021-03573-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03573-3

Keywords

Navigation