Elsevier

Knowledge-Based Systems

Volume 107, 1 September 2016, Pages 129-141
Knowledge-Based Systems

Cross-lingual sentiment classification: Similarity discovery plus training data adjustment

https://doi.org/10.1016/j.knosys.2016.06.004Get rights and content

Abstract

The performance of cross-lingual sentiment classification is sharply limited by the language gap, which means that each language has its own ways to express sentiments. Many methods have been designed to transmit sentiment information across languages by making use of machine translation, parallel corpora, auxiliary unlabeled samples and other resources. In this paper, a new approach is proposed based on the selection of training data, where labeled samples highly similar to the target language are put into the training set. The refined training samples are used to build up an effective cross-lingual sentiment classifier focusing on the target language. The proposed approach contains two major strategies: the aligned-translation topic model and the semi-supervised training data adjustment. The aligned-translation topic model provides a cross-language representation space in which the semi-supervised training data adjustment procedure attempts to select effective training samples to eliminate the negative influence of the semantic distribution differences between the original and target languages. The experiments show that the proposed approach is feasible for cross-language sentiment classification tasks and provides insight into the semantic relationship between two different languages.

Introduction

As Internet access has become globally convenient, our world has experienced the development of a new fashion of social media, in which the volume of user-generated content on the web has massively increased through applications such as Facebook, Twitter, Flickr and LinkedIn, as well as commercial web sites. The value of such unbiased, real-time user-generated content has been shown to be tremendous, with applications in areas such as marketing, decision support systems, politics and public policy support. Due to the enormous amount of user content, it is a very difficult and challenging task to summarize information from online user content. Many natural language processing and information retrieval systems have been designed to automatically treat text and opinion utilizing subjectivity and sentiment analysis [1], [2], [3], [4].

Different methods have been applied in sentiment classification tasks. These methods can be categorized into two main groups: lexicon-based and corpus-based [1], [5], [6]. Both the lexicon-based and corpus-based methods draw sentiment classification information from expert-annotated data sets. However, all of these sentiment classification resources are established in a limited number of languages, which leads to a resource imbalance between different languages. Most sentiment classification resources are written in English [7], [8]. Furthermore, the manual construction of reliable sentiment resources is a very difficult and time-consuming task. Therefore, it would be advantageous to utilize labeled sentiment resources in one language (i.e., English) for sentiment classification in another language. This fantastic idea motivates an interesting research area called cross-lingual sentiment classification (CLSC). The most direct solution to this problem is to use machine translation systems to directly project the information from one language into another [7], [8], [9], [10], [11], [12]. Most existing works in this area have applied machine translation systems to translate labelled training data from the source language into the target language and perform sentiment classification into the target language [13], [14]. Other researchers have employed machine translation in another manner, translating unlabeled test data from the target language into the source language and performing the classification in the source language [7], [11], [15]. A limited number of research works have used both directions of translation to create two different views of the training and test data to compensate for some of the translation limitations [8], [12], [16], [17].

The large gap between different languages occurs naturally. Every language has its unique linguistic terms and writing styles. Even when expressing similar idea, there can be a great disparity in the metaphor and vocabulary in contents of different languages, leading to a much smaller word and phrase intersection between translations and native expressions, as well as different semantic feature distributions between original language and target language contents. As a result, CLSC tasks cannot achieve performance comparable to that obtained for monolingual sentiment classification tasks. To alleviate the problem of this language gap, auxiliary unlabeled corpora or unlabeled parallel corpora are added into the training stage [8], [18] to provide more bilingual word features. This strategy extends the training set and makes origin language and target language closer in the representation space. However, the complementary data brings in useful information as well as noise at the same time because it does not have exactly the same distribution as the training data and the test data. Especially when there is already a distribution disparity between the training set and the test set, noise of the complementary data may cause more adverse impacts than benefits. Additionally, the complementary data itself requires effort to be obtained, which imposes restrictions on the application of these methods.

Based on the above analysis, we try to overcome the difficulty of distribution disparity directly without any auxiliary samples. This paper proposes a novel CLSC strategy of similarity discovery plus training data adjustment (SD-TDA). In the similarity discovery phase, we set up an aligned-translation topic model to generate a bilingual concept representation space where the difference in content between samples from both origin and target languages can be measured through the topic distribution. As the aligned-translation topic model takes in co-occurrence information of terms in one language and between the original and target languages, the relevance of cross-lingual sentiment can be sharply enhanced. Then, in the training data adjustment phase, we set up a semi-supervised process to generate a training set suitable for the target test set. The generated training set is a part of the labeled original language samples that are similar to the reference samples. The reference samples are informative unlabeled target language samples that can be classified by the semi-supervised process with a high degree of confidence. The final classifier will be trained on the generated training set to fulfill the cross-lingual sentiment classification task. When these two steps work together, the sentiment similarity between languages can be maximized, while the distribution gap can be minimized. Generally, our strategy forms a new framework to fit the distribution disparity between the training set and the test set.

Section snippets

Related works

Cross-lingual sentiment classification. Cross-lingual sentiment classification is a type of text classification task. Bel et al. [19] first proposed the cross-lingual sentiment classification task, while earlier studies focused on cross-lingual information retrieval. Traditional cross-lingual classification and information retrieval tasks usually build up semantic mapping between languages based on resources such as bilingual lexicons or bilingual parallel corpora [20], [21]. Based on these

Basic framework

The cross-lingual sentiment classification aims at predicting sentiment labels of target language samples using labeled training samples in the original language. The cross-lingual sentiment classification is one type of text classification task based on corpuses, but with its own characteristic. Different from traditional text classification tasks, in cross-lingual tasks, words included in the training sample and test samples naturally have different characters. During the traditional text

Aligned-translation topic model (ATTM)

In this section, we introduce how to develop a topic model from aligned-translation data. The topic model could cluster relevant words to form a dense concept space to describe the content of the textual data. In cross-lingual sentiment tasks, we need a concept space holding topics from both original and target languages. The same concepts of synonymous content from different languages need to be clustered into relative topics. The cross-lingual synonym content relations usually have to be

Semi-supervised training data adjustment

After the ATTM, both the original and target language samples are represented in the concept topic space. The ATTM clusters relative words from the original and target languages into corresponding topics, enhancing the coherence of the content concept. However the concept divergences of the original language and the target language cannot be eliminated by the similarity discovery strategy, for there are still topics on different aspects for different languages. The language gap still exists in

Experiments and evaluations

In this section, we introduce the experiments to evaluate the proposed method. The experiments included test datasets in four different languages, and the training datasets were in Chinese. We tested the influence of the parameters in our method, and selected the best parameters to verify the effectiveness of the proposed method. We compared our experimental results with the co-training method, transductive support vector machine, and the best performance of the COAE2014 tasks. The details of

Conclusions and future work

This paper has proposed an effective method to fulfill the cross-lingual sentiment classification task. We concentrate on the primary obstacles of cross-lingual sentiment classification, which are the cross-lingual concept representation space construction and the language gap. The proposed similarity discovery plus training data adjustment framework includes two stages that separately form a concept representation space based on the cross-lingual word co-occurrence relations and overcome the

Acknowledgments

The authors would like to appreciate all anonymous reviewers for their valuable comments and suggestions which have significantly improved the quality and presentation of this paper. This work was supported by the National High-Tech Research and Development Program (863 Program) (2015AA011808); the National Natural Science Foundation of China (61432011, 61573231, 61175067, 61272095, U1435212, 41401521); the Shanxi Province Returned Overseas Research Project (2013-014); the Shanxi Province

References (36)

  • F. Xianghua et al.

    Multi-aspect sentiment analysis for chinese online social reviews based on topic modeling and hownet lexicon

    Knowl.Based Sys.

    (2013)
  • X. Fu et al.

    Dynamic non-parametric joint sentiment topic mixture model

    Knowl.Based Sys.

    (2015)
  • M.S. Hajmohammadi et al.

    Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples

    Inform. Sci.

    (2015)
  • Y. Zhang et al.

    Semi-supervised learning combining co-training with active learning

    Expert Syst. Appl.

    (2014)
  • B. Pang et al.

    Opinion mining and sentiment analysis

    Found. trends inform.retriev.

    (2008)
  • M. Taboada et al.

    Lexicon-based methods for sentiment analysis

    Comput. linguist.

    (2011)
  • X. Wan

    Bilingual co-training for sentiment classification of chinese product reviews

    Comput. linguist.

    (2011)
  • C. Banea et al.

    Multilingual subjectivity analysis using machine translation

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    (2008)
  • Cited by (0)

    View full text