Joint Model Using Character and Word Embeddings for Detecting Internet Slang Words

Liu, Yihong; Seki, Yohei

doi:10.1007/978-3-030-91669-5_2

Yihong Liu¹¹ &
Yohei Seki¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13133))

Included in the following conference series:

International Conference on Asian Digital Libraries

1332 Accesses
1 Citations

Abstract

The language style on social media platforms is informal and many Internet slang words are used. The presence of such out-of-vocabulary words significantly degrades the performance of language models used for linguistic analysis. This paper presents a novel corpus of Japanese Internet slang words in context and partitions them into two major types and 10 subcategories according to their definitions. The existing word-level or character-level embedding models have shown remarkable improvement with a variety of natural-language processing tasks but often struggle with out-of-vocabulary words such as slang words. We therefore propose a joint model that combines word-level and character-level embeddings as token representations of the text. We have tested our model against other language models with respect to type/subcategory recognition. With fine-grained subcategories, it is possible to analyze the performance of each model in more detail according to the word formation of Internet slang categories. Our experimental results show that our joint model achieves state-of-the-art performance when dealing with Internet slang words, detecting semantic changes accurately while also locating another type of novel combinations of characters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Chinese word embeddings from semantic and phonetic components

Article 10 August 2022

Integrating Character Representations into Chinese Word Embedding

Joining External Context Characters to Improve Chinese Word Embedding

Notes

1.
https://twitter.com/.
2.
“Foreign words” in Japanese.
3.
https://developer.twitter.com/en/docs.
4.
https://numan.tokyo/words/.
5.
https://www.mlab.im.dendai.ac.jp/~yamada/ir/MorphologicalAnalyzer/MeCab.html.
6.
https://dumps.wikimedia.org/jawiki/latest/ [accessed on October 2020].
7.
https://github.com/cl-tohoku/elmo-japanese.
8.
http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/.
9.
“c” and “w” denote character and word embeddings, respectively.

References

Chambers, J.K.: Sociolinguistic Theory, 3rd edn. Wiley-Blackwell (2008)
Google Scholar
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI 2015), pp. 1236–1242, July 2015
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), Minneapolis, Minnesota, vol. 1, pp. 4171–4186, June 2019
Google Scholar
Hida, Y., Endo, Y., Kato, M., Sato, T., Hachiya, K., Maeda, T.: The research encyclopedia of Japanese linguistic. Jpn. Liguist. 3(4), 125–126 (2007). (in Japanese)
Google Scholar
Kersloot, M.G., van Putten, F.J.P., Abu-Hanna, A., Cornet, R., Arts, D.L.: Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. J. Biomed. Semant. 11 (2020)
Google Scholar
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP 2018), Brussels, Belgium, pp. 66–71. Association for Computational Linguistics, November 2018
Google Scholar
Kundi, F.M., Ahmad, S., Khan, A., Asghar, M.Z.: Detection and scoring of internet slangs for sentiment analysis using SentiWordNet. Life Sci. J. 11(9), 66–72 (2014)
Google Scholar
Kuwamoto, Y.: A shift of morphological and semantic structures in ambiguous expression of Japanese Youth Jargons Wakamono-kotoba: approaching a diachronic study with a database of a TV drama. Natl. Inst. Technol. Akita Coll. 49, 68–75 (2014). (in Japanese)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270, June 2016. https://doi.org/10.18653/v1/N16-1030. https://www.aclweb.org/anthology/N16-1030
Li, X., Meng, Y., Sun, X., Han, Q., Yuan, A., Li, J.: Is word segmentation necessary for deep learning of Chinese representations? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3242–3452, July 2019
Google Scholar
Ma, W., Cui, Y., Si, C., Liu, T., Wang, S., Hu, G.: CharBERT: character-aware pre-trained language model. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), Barcelona, Spain, pp. 39–50, December 2020
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013). http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Patel, K., Caragea, C., Wu, J., Giles, C.L.: Keyphrase extraction in scholarly digital library search engines. In: IEEE International Conference on Web Services (ICWS 2020), pp. 179–196, October 2020
Google Scholar
Peters, M., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1756–1765, July 2017. https://doi.org/10.18653/v1/P17-1161. https://www.aclweb.org/anthology/P17-1161
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long Papers), New Orleans, Louisiana, vol. 1, pp. 2227–2237, June 2018. https://doi.org/10.18653/v1/N18-1202. https://www.aclweb.org/anthology/N18-1202
Pinter, Y., Jacobs, C.L., Bittker, M.: NYTWIT: a dataset of novel words in the New York times. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), Barcelona, Spain, pp. 6509–6515. International Committee on Computational Linguistics, December 2020. https://www.aclweb.org/anthology/2020.coling-main.572
Pinter, Y., Marone, M., Eisenstein, J.: Character eyes: seeing language through character-level taggers. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 95–102, August 2019. https://doi.org/10.18653/v1/W19-4811. https://www.aclweb.org/anthology/W19-4811
Qiao, X., Peng, C., Liu, Z., Hu, Y.: Word-character attention model for Chinese text classification. Int. J. Mach. Learn. Cybern. 10(12), 3521–3537 (2019)
Article Google Scholar
Rei, M., Crichton, G., Pyysalo, S.: Attending to characters in neural sequence labeling models. In: Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING 2016), Osaka, Japan, pp. 309–318, December 2016. https://www.aclweb.org/anthology/C16-1030
Samanta, K.S., Rath, D.S.: Social tags versus LCSH descriptors: a comparative metadata analysis in the field of economics. J. Libr. Inf. Technol. 39(4), 145–151 (2019)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725, August 2016
Google Scholar
Shibata, T., Kawahara, D., Kurohashi, S.: Improved accuracy of Japanese parsing with BERT. In: Proceedings of 25th Annual Meeting of the Association for Natural Language Processing, pp. 205–208 (2019). (in Japanese)
Google Scholar
Sun, Y., Lin, L., Yang, N., Ji, Z., Wang, X.: Radical-enhanced Chinese character embedding. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8835, pp. 279–286. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12640-1_34
Chapter Google Scholar
Ulčar, M., Robnik-Šikonja, M.: High quality ELMo embeddings for seven less-resourced languages. In: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille, France, pp. 4731–4738. European Language Resources Association, May 2020. https://aclanthology.org/2020.lrec-1.582
Yonekawa, A.: New Words and Slang Words. NAN’UN-DO Publishing (1989). (in Japanese)
Google Scholar
Zhang, T., Wu, F., Katiyar, A., Weinberger, K.Q., Artzi, Y.: Revisiting few-sample BERT fine-tuning. In: Proceedings of International Conference on Learning Representations (ICLR 2021), May 2021. https://openreview.net/forum?id=cO1IH43yUF
Zhao, X., Hamamoto, M., Fujihara, H.: Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus (2020). https://github.com/laboroai/Laboro-BERT-Japanese

Download references

Acknowledgments

We are very grateful to Dr. Wakako Kashino at the National Institute for Japanese Language and Linguistics for her guidance and help in identifying and classifying Japanese Internet slang words. We are also grateful to the Japanese members of our research laboratory for their help in the annotation and checking of the dataset. This work was partially supported by a Japanese Society for the Promotion of Science Grant-in-Aid for Scientific Research (B) (#19H04420).

Author information

Authors and Affiliations

Graduate School of Comprehensive Human Sciences, University of Tsukuba, Tsukuba, Japan
Yihong Liu
Faculty of Library, Information and Media Science, University of Tsukuba, Tsukuba, Japan
Yohei Seki

Authors

Yihong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yohei Seki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yohei Seki .

Editor information

Editors and Affiliations

National Taiwan Normal University, Taipei, Taiwan
Hao-Ren Ke
Nanyang Technological University, Singapore, Singapore
Chei Sian Lee
Kyoto University, Kyoto, Japan
Kazunari Sugiyama

Ethics declarations

This research was conducted with the approval of the Ethics Review Committee of the Faculty of Library, Information and Media Science, the University of Tsukuba. The participants in the corpus creation experiment were asked to sign a consent form in advance and were allowed to quit the experiment at any time.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Seki, Y. (2021). Joint Model Using Character and Word Embeddings for Detecting Internet Slang Words. In: Ke, HR., Lee, C.S., Sugiyama, K. (eds) Towards Open and Trustworthy Digital Societies. ICADL 2021. Lecture Notes in Computer Science(), vol 13133. Springer, Cham. https://doi.org/10.1007/978-3-030-91669-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-91669-5_2
Published: 30 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91668-8
Online ISBN: 978-3-030-91669-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Joint Model Using Character and Word Embeddings for Detecting Internet Slang Words