Cross-lingual sentiment classification: Similarity discovery plus training data adjustment
Introduction
As Internet access has become globally convenient, our world has experienced the development of a new fashion of social media, in which the volume of user-generated content on the web has massively increased through applications such as Facebook, Twitter, Flickr and LinkedIn, as well as commercial web sites. The value of such unbiased, real-time user-generated content has been shown to be tremendous, with applications in areas such as marketing, decision support systems, politics and public policy support. Due to the enormous amount of user content, it is a very difficult and challenging task to summarize information from online user content. Many natural language processing and information retrieval systems have been designed to automatically treat text and opinion utilizing subjectivity and sentiment analysis [1], [2], [3], [4].
Different methods have been applied in sentiment classification tasks. These methods can be categorized into two main groups: lexicon-based and corpus-based [1], [5], [6]. Both the lexicon-based and corpus-based methods draw sentiment classification information from expert-annotated data sets. However, all of these sentiment classification resources are established in a limited number of languages, which leads to a resource imbalance between different languages. Most sentiment classification resources are written in English [7], [8]. Furthermore, the manual construction of reliable sentiment resources is a very difficult and time-consuming task. Therefore, it would be advantageous to utilize labeled sentiment resources in one language (i.e., English) for sentiment classification in another language. This fantastic idea motivates an interesting research area called cross-lingual sentiment classification (CLSC). The most direct solution to this problem is to use machine translation systems to directly project the information from one language into another [7], [8], [9], [10], [11], [12]. Most existing works in this area have applied machine translation systems to translate labelled training data from the source language into the target language and perform sentiment classification into the target language [13], [14]. Other researchers have employed machine translation in another manner, translating unlabeled test data from the target language into the source language and performing the classification in the source language [7], [11], [15]. A limited number of research works have used both directions of translation to create two different views of the training and test data to compensate for some of the translation limitations [8], [12], [16], [17].
The large gap between different languages occurs naturally. Every language has its unique linguistic terms and writing styles. Even when expressing similar idea, there can be a great disparity in the metaphor and vocabulary in contents of different languages, leading to a much smaller word and phrase intersection between translations and native expressions, as well as different semantic feature distributions between original language and target language contents. As a result, CLSC tasks cannot achieve performance comparable to that obtained for monolingual sentiment classification tasks. To alleviate the problem of this language gap, auxiliary unlabeled corpora or unlabeled parallel corpora are added into the training stage [8], [18] to provide more bilingual word features. This strategy extends the training set and makes origin language and target language closer in the representation space. However, the complementary data brings in useful information as well as noise at the same time because it does not have exactly the same distribution as the training data and the test data. Especially when there is already a distribution disparity between the training set and the test set, noise of the complementary data may cause more adverse impacts than benefits. Additionally, the complementary data itself requires effort to be obtained, which imposes restrictions on the application of these methods.
Based on the above analysis, we try to overcome the difficulty of distribution disparity directly without any auxiliary samples. This paper proposes a novel CLSC strategy of similarity discovery plus training data adjustment (SD-TDA). In the similarity discovery phase, we set up an aligned-translation topic model to generate a bilingual concept representation space where the difference in content between samples from both origin and target languages can be measured through the topic distribution. As the aligned-translation topic model takes in co-occurrence information of terms in one language and between the original and target languages, the relevance of cross-lingual sentiment can be sharply enhanced. Then, in the training data adjustment phase, we set up a semi-supervised process to generate a training set suitable for the target test set. The generated training set is a part of the labeled original language samples that are similar to the reference samples. The reference samples are informative unlabeled target language samples that can be classified by the semi-supervised process with a high degree of confidence. The final classifier will be trained on the generated training set to fulfill the cross-lingual sentiment classification task. When these two steps work together, the sentiment similarity between languages can be maximized, while the distribution gap can be minimized. Generally, our strategy forms a new framework to fit the distribution disparity between the training set and the test set.
Section snippets
Related works
Cross-lingual sentiment classification. Cross-lingual sentiment classification is a type of text classification task. Bel et al. [19] first proposed the cross-lingual sentiment classification task, while earlier studies focused on cross-lingual information retrieval. Traditional cross-lingual classification and information retrieval tasks usually build up semantic mapping between languages based on resources such as bilingual lexicons or bilingual parallel corpora [20], [21]. Based on these
Basic framework
The cross-lingual sentiment classification aims at predicting sentiment labels of target language samples using labeled training samples in the original language. The cross-lingual sentiment classification is one type of text classification task based on corpuses, but with its own characteristic. Different from traditional text classification tasks, in cross-lingual tasks, words included in the training sample and test samples naturally have different characters. During the traditional text
Aligned-translation topic model (ATTM)
In this section, we introduce how to develop a topic model from aligned-translation data. The topic model could cluster relevant words to form a dense concept space to describe the content of the textual data. In cross-lingual sentiment tasks, we need a concept space holding topics from both original and target languages. The same concepts of synonymous content from different languages need to be clustered into relative topics. The cross-lingual synonym content relations usually have to be
Semi-supervised training data adjustment
After the ATTM, both the original and target language samples are represented in the concept topic space. The ATTM clusters relative words from the original and target languages into corresponding topics, enhancing the coherence of the content concept. However the concept divergences of the original language and the target language cannot be eliminated by the similarity discovery strategy, for there are still topics on different aspects for different languages. The language gap still exists in
Experiments and evaluations
In this section, we introduce the experiments to evaluate the proposed method. The experiments included test datasets in four different languages, and the training datasets were in Chinese. We tested the influence of the parameters in our method, and selected the best parameters to verify the effectiveness of the proposed method. We compared our experimental results with the co-training method, transductive support vector machine, and the best performance of the COAE2014 tasks. The details of
Conclusions and future work
This paper has proposed an effective method to fulfill the cross-lingual sentiment classification task. We concentrate on the primary obstacles of cross-lingual sentiment classification, which are the cross-lingual concept representation space construction and the language gap. The proposed similarity discovery plus training data adjustment framework includes two stages that separately form a concept representation space based on the cross-lingual word co-occurrence relations and overcome the
Acknowledgments
The authors would like to appreciate all anonymous reviewers for their valuable comments and suggestions which have significantly improved the quality and presentation of this paper. This work was supported by the National High-Tech Research and Development Program (863 Program) (2015AA011808); the National Natural Science Foundation of China (61432011, 61573231, 61175067, 61272095, U1435212, 41401521); the Shanxi Province Returned Overseas Research Project (2013-014); the Shanxi Province
References (36)
- et al.
A survey on opinion mining and sentiment analysis: tasks, approaches and applications
Knowl. Based Sys.
(2015) - et al.
Enriching semantic knowledge bases for opinion mining in big data applications
Knowl. Based Sys.
(2014) - et al.
Methods for cross-language plagiarism detection
Knowl.Based Sys.
(2013) - et al.
Document-level sentiment classification: An empirical comparison between svm and ann
Expert Sys. Appl.
(2013) - et al.
Sentiment polarity detection in spanish reviews combining supervised and unsupervised approaches
Expert Sys. Appl.
(2013) - et al.
Computational approaches to subjectivity and sentiment analysis: Present and envisaged methods and applications
Comput. Speech Lang.
(2014) - et al.
Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis
Comput. Speech Lang.
(2014) - et al.
Bi-view semi-supervised active learning for cross-lingual sentiment classification
Inform. Process. Manag.
(2014) - et al.
Automatic construction of domain-specific sentiment lexicon based on constrained label propagation
Knowl. Based Sys.
(2014) - et al.
Choosing the best dictionary for cross-lingual word sense disambiguation
Knowl. Based Sys.
(2015)