Abstract
The language style on social media platforms is informal and many Internet slang words are used. The presence of such out-of-vocabulary words significantly degrades the performance of language models used for linguistic analysis. This paper presents a novel corpus of Japanese Internet slang words in context and partitions them into two major types and 10 subcategories according to their definitions. The existing word-level or character-level embedding models have shown remarkable improvement with a variety of natural-language processing tasks but often struggle with out-of-vocabulary words such as slang words. We therefore propose a joint model that combines word-level and character-level embeddings as token representations of the text. We have tested our model against other language models with respect to type/subcategory recognition. With fine-grained subcategories, it is possible to analyze the performance of each model in more detail according to the word formation of Internet slang categories. Our experimental results show that our joint model achieves state-of-the-art performance when dealing with Internet slang words, detecting semantic changes accurately while also locating another type of novel combinations of characters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
“Foreign words” in Japanese.
- 3.
- 4.
- 5.
- 6.
https://dumps.wikimedia.org/jawiki/latest/ [accessed on October 2020].
- 7.
- 8.
- 9.
“c” and “w” denote character and word embeddings, respectively.
References
Chambers, J.K.: Sociolinguistic Theory, 3rd edn. Wiley-Blackwell (2008)
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI 2015), pp. 1236–1242, July 2015
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), Minneapolis, Minnesota, vol. 1, pp. 4171–4186, June 2019
Hida, Y., Endo, Y., Kato, M., Sato, T., Hachiya, K., Maeda, T.: The research encyclopedia of Japanese linguistic. Jpn. Liguist. 3(4), 125–126 (2007). (in Japanese)
Kersloot, M.G., van Putten, F.J.P., Abu-Hanna, A., Cornet, R., Arts, D.L.: Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. J. Biomed. Semant. 11 (2020)
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP 2018), Brussels, Belgium, pp. 66–71. Association for Computational Linguistics, November 2018
Kundi, F.M., Ahmad, S., Khan, A., Asghar, M.Z.: Detection and scoring of internet slangs for sentiment analysis using SentiWordNet. Life Sci. J. 11(9), 66–72 (2014)
Kuwamoto, Y.: A shift of morphological and semantic structures in ambiguous expression of Japanese Youth Jargons Wakamono-kotoba: approaching a diachronic study with a database of a TV drama. Natl. Inst. Technol. Akita Coll. 49, 68–75 (2014). (in Japanese)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270, June 2016. https://doi.org/10.18653/v1/N16-1030. https://www.aclweb.org/anthology/N16-1030
Li, X., Meng, Y., Sun, X., Han, Q., Yuan, A., Li, J.: Is word segmentation necessary for deep learning of Chinese representations? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3242–3452, July 2019
Ma, W., Cui, Y., Si, C., Liu, T., Wang, S., Hu, G.: CharBERT: character-aware pre-trained language model. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), Barcelona, Spain, pp. 39–50, December 2020
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013). http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Patel, K., Caragea, C., Wu, J., Giles, C.L.: Keyphrase extraction in scholarly digital library search engines. In: IEEE International Conference on Web Services (ICWS 2020), pp. 179–196, October 2020
Peters, M., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1756–1765, July 2017. https://doi.org/10.18653/v1/P17-1161. https://www.aclweb.org/anthology/P17-1161
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long Papers), New Orleans, Louisiana, vol. 1, pp. 2227–2237, June 2018. https://doi.org/10.18653/v1/N18-1202. https://www.aclweb.org/anthology/N18-1202
Pinter, Y., Jacobs, C.L., Bittker, M.: NYTWIT: a dataset of novel words in the New York times. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), Barcelona, Spain, pp. 6509–6515. International Committee on Computational Linguistics, December 2020. https://www.aclweb.org/anthology/2020.coling-main.572
Pinter, Y., Marone, M., Eisenstein, J.: Character eyes: seeing language through character-level taggers. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 95–102, August 2019. https://doi.org/10.18653/v1/W19-4811. https://www.aclweb.org/anthology/W19-4811
Qiao, X., Peng, C., Liu, Z., Hu, Y.: Word-character attention model for Chinese text classification. Int. J. Mach. Learn. Cybern. 10(12), 3521–3537 (2019)
Rei, M., Crichton, G., Pyysalo, S.: Attending to characters in neural sequence labeling models. In: Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING 2016), Osaka, Japan, pp. 309–318, December 2016. https://www.aclweb.org/anthology/C16-1030
Samanta, K.S., Rath, D.S.: Social tags versus LCSH descriptors: a comparative metadata analysis in the field of economics. J. Libr. Inf. Technol. 39(4), 145–151 (2019)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725, August 2016
Shibata, T., Kawahara, D., Kurohashi, S.: Improved accuracy of Japanese parsing with BERT. In: Proceedings of 25th Annual Meeting of the Association for Natural Language Processing, pp. 205–208 (2019). (in Japanese)
Sun, Y., Lin, L., Yang, N., Ji, Z., Wang, X.: Radical-enhanced Chinese character embedding. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8835, pp. 279–286. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12640-1_34
Ulčar, M., Robnik-Šikonja, M.: High quality ELMo embeddings for seven less-resourced languages. In: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille, France, pp. 4731–4738. European Language Resources Association, May 2020. https://aclanthology.org/2020.lrec-1.582
Yonekawa, A.: New Words and Slang Words. NAN’UN-DO Publishing (1989). (in Japanese)
Zhang, T., Wu, F., Katiyar, A., Weinberger, K.Q., Artzi, Y.: Revisiting few-sample BERT fine-tuning. In: Proceedings of International Conference on Learning Representations (ICLR 2021), May 2021. https://openreview.net/forum?id=cO1IH43yUF
Zhao, X., Hamamoto, M., Fujihara, H.: Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus (2020). https://github.com/laboroai/Laboro-BERT-Japanese
Acknowledgments
We are very grateful to Dr. Wakako Kashino at the National Institute for Japanese Language and Linguistics for her guidance and help in identifying and classifying Japanese Internet slang words. We are also grateful to the Japanese members of our research laboratory for their help in the annotation and checking of the dataset. This work was partially supported by a Japanese Society for the Promotion of Science Grant-in-Aid for Scientific Research (B) (#19H04420).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
This research was conducted with the approval of the Ethics Review Committee of the Faculty of Library, Information and Media Science, the University of Tsukuba. The participants in the corpus creation experiment were asked to sign a consent form in advance and were allowed to quit the experiment at any time.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y., Seki, Y. (2021). Joint Model Using Character and Word Embeddings for Detecting Internet Slang Words. In: Ke, HR., Lee, C.S., Sugiyama, K. (eds) Towards Open and Trustworthy Digital Societies. ICADL 2021. Lecture Notes in Computer Science(), vol 13133. Springer, Cham. https://doi.org/10.1007/978-3-030-91669-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-91669-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91668-8
Online ISBN: 978-3-030-91669-5
eBook Packages: Computer ScienceComputer Science (R0)