To read this content please select one of the options below:

Multilingual emoji prediction using BERT for sentiment analysis

Toshiki Tomihira (College of Knowledge and Library Sciences, University of Tsukuba, Tsukuba, Japan)
Atsushi Otsuka (Faculty of Library, Information and Media Science, University of Tsukuba, Tsukuba, Japan)
Akihiro Yamashita (Department of Computer Science, Tokyo National College of Technology, Tokyo, Japan)
Tetsuji Satoh (Faculty of Library, Information and Media Studies, University of Tsukuba, Tsukuba, Japan)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 24 September 2020

Issue publication date: 8 October 2020

801

Abstract

Purpose

Recently, Unicode has been standardized with the penetration of social networking services, the use of emojis has become common. Emojis, as they are also known, are most effective in expressing emotions in sentences. Sentiment analysis in natural language processing manually labels emotions for sentences. The authors can predict sentiment using emoji of text posted on social media without labeling manually. The purpose of this paper is to propose a new model that learns from sentences using emojis as labels, collecting English and Japanese tweets from Twitter as the corpus. The authors verify and compare multiple models based on attention long short-term memory (LSTM) and convolutional neural networks (CNN) and Bidirectional Encoder Representations from Transformers (BERT).

Design/methodology/approach

The authors collected 2,661 kinds of emoji registered as Unicode characters from tweets using Twitter application programming interface. It is a total of 6,149,410 tweets in Japanese. First, the authors visualized a vector space produced by the emojis by Word2Vec. In addition, the authors found that emojis and similar meaning words of emojis are adjacent and verify that emoji can be used for sentiment analysis. Second, it involves entering a line of tweets containing emojis, learning and testing with that emoji as a label. The authors compared the BERT model with the conventional models [CNN, FastText and Attention bidirectional long short-term memory (BiLSTM)] that were high scores in the previous study.

Findings

Visualized the vector space of Word2Vec, the authors found that emojis and similar meaning words of emojis are adjacent and verify that emoji can be used for sentiment analysis. The authors obtained a higher score with BERT models compared to the conventional model. Therefore, the sophisticated experiments demonstrate that they improved the score over the conventional model in two languages. General emoji prediction is greatly influenced by context. In addition, the score may be lowered due to a misunderstanding of meaning. By using BERT based on a bi-directional transformer, the authors can consider the context.

Practical implications

The authors can find emoji in the output words by typing a word using an input method editor (IME). The current IME only considers the most latest inputted word, although it is possible to recommend emojis considering the context of the inputted sentence in this study. Therefore, the research can be used to improve IME performance in the future.

Originality/value

In the paper, the authors focus on multilingual emoji prediction. This is the first attempt of comparison at emoji prediction between Japanese and English. In addition, it is also the first attempt to use the BERT model based on the transformer for predicting limited emojis although the transformer is known to be effective for various NLP tasks. The authors found that a bidirectional transformer is suitable for emoji prediction.

Keywords

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number JP16H02904.

Citation

Tomihira, T., Otsuka, A., Yamashita, A. and Satoh, T. (2020), "Multilingual emoji prediction using BERT for sentiment analysis", International Journal of Web Information Systems, Vol. 16 No. 3, pp. 265-280. https://doi.org/10.1108/IJWIS-09-2019-0042

Publisher

:

Emerald Publishing Limited

Copyright © 2020, Emerald Publishing Limited

Related articles