Abstract
Traditional machine learning classifiers usually fail at predicting labels for new data when their distribution differs from the training data distribution. This is particularly true with sentiment classifiers as the vocabulary and people’s opinions rapidly evolve. Naturally, the problem aggravates when there are only a few or even none labeled instances in the target domain. In this paper, we propose a dataset recommendation method based on multilingual embeddings and similarity metrics to properly choose sentiment analysis datasets to be used as training set when labeled data is unavailable or scarce. We adopted the sentiment analysis of electoral domain as our case study, considering the complexity and difficulty for manually label millions of political social media opinions during the short period of campaigns. Our results suggest that dataset similarity may be considered, even when datasets belong to different languages, to minimize negative effects that may occur due to domain shift in sentiment classification tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Moslmi, T., Omar, N., Abdullah, S., Albared, M.: Approaches to cross-domain sentiment analysis: A systematic literature review. IEEE Access 5, 16173–16192 (2017)
Bilal, M., Gani, A., Marjani, M., Malik, N.: Predicting elections: social media data and techniques. In: 2019 International Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6. IEEE (2019)
Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128 (2006)
Calais Guerra, P.H., Veloso, A., Meira Jr, W., Almeida, V.: From bias to opinion: a transfer-learning approach to real-time sentiment analysis. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158 (2011)
Chidambaram, M., et al.: Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836 (2018)
Dai, X., Karimi, S., Hachey, B., Paris, C.: Using similarity measures to select pretraining data for NER. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1460–1470 (2019)
Elsahar, H., Gallé, M.: To annotate or not? Predicting performance drop under domain shift. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 2163–2173 (2019)
Fan, W., Davidson, I.: Reverse testing: an efficient framework to select amongst classifiers under sample selection bias. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 147–156 (2006)
Ghani, N.A., Hamid, S., Hashem, I.A.T., Ahmed, E.: Social media big data analytics: a survey. Comput. Hum. Behav. 101, 417–428 (2019)
Joshi, M., Prajapati, P., Shaikh, A., Vala, V.: A survey on sentiment analysis. Int. J. Comput. Appl. 163(6), 34–38 (2017)
Kouw, W.M., Loog, M.: An introduction to domain adaptation and transfer learning. arXiv preprint arXiv:1812.11806 (2018)
Li, N., Zhai, S., Zhang, Z., Liu, B.: Structural correspondence learning for cross-lingual sentiment classification with one-to-many mappings. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Li, Y., Guo, H., Zhang, Q., Gu, M., Yang, J.: Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowl. Based Syst. 160, 1–15 (2018)
Liu, B.: Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Studies in Natural Language Processing, 2 edn. Cambridge University Press, (2020). https://doi.org/10.1017/9781108639286
Mahendiran, A., et al.: Discovering evolving political vocabulary in social media. In: 2014 International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC2014), pp. 1–7. IEEE (2014)
Pan, S.J., Ni, X., Sun, J.T., Yang, Q., Chen, Z.: Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of the 19th International Conference on World Wide Web, pp. 751–760 (2010)
Santos, J.S., Bernardini, F., Paes, A.: Measuring the degree of divergence when labeling tweets in the electoral scenario. In: Anais do X Brazilian Workshop on Social Network Analysis and Mining. pp. 127–138. SBC (2021)
Santos, J.S., Bernardini, F., Paes, A.: A survey on the use of data and opinion mining in social media to political electoral outcomes prediction. Soc. Netw. Anal. Min. 11(1), 1–39 (2021)
Santos, J.S., Paes, A., Bernardini, F.: Combining labeled datasets for sentiment analysis from different domains based on dataset similarity to predict electors sentiment. In: Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 455–460. IEEE (2019)
Wu, F., Huang, Y.: Sentiment domain adaptation with multiple sources. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 301–310 (2016)
Wu, F., Huang, Y., Yuan, Z.: Domain-specific sentiment classification via fusing sentiment knowledge from multiple sources. Inf. Fus. 35, 26–37 (2017)
Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307 (2019)
Zhang, Y., Hu, X., Li, P., Li, L., Wu, X.: Cross-domain sentiment classification-feature divergence, polarity divergence or both? Pattern Recogn. Lett. 65, 44–50 (2015)
Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.: Cross validation framework to choose amongst models and datasets for transfer learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 547–562. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_35
Acknowledgement
This research was supported by the Brazilian Research CNPq APQ Universal (Grant 421608/2018-8), CNPq Research Grant 311275/2020-6, FAPERJ Research grant E26/202.914/2019 (247109), Microsoft Research Grant and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
dos Santos, J.S., Bernardini, F., Paes, A. (2022). Similarity-Based Dataset Recommendation Across Languages and Domains to Sentiment Analysis in the Electoral Domain. In: Krimmer, R., et al. Electronic Participation. ePart 2022. Lecture Notes in Computer Science, vol 13392. Springer, Cham. https://doi.org/10.1007/978-3-031-23213-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-23213-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23212-1
Online ISBN: 978-3-031-23213-8
eBook Packages: Computer ScienceComputer Science (R0)