Skip to main content

Similarity-Based Dataset Recommendation Across Languages and Domains to Sentiment Analysis in the Electoral Domain

  • Conference paper
  • First Online:
Electronic Participation (ePart 2022)

Abstract

Traditional machine learning classifiers usually fail at predicting labels for new data when their distribution differs from the training data distribution. This is particularly true with sentiment classifiers as the vocabulary and people’s opinions rapidly evolve. Naturally, the problem aggravates when there are only a few or even none labeled instances in the target domain. In this paper, we propose a dataset recommendation method based on multilingual embeddings and similarity metrics to properly choose sentiment analysis datasets to be used as training set when labeled data is unavailable or scarce. We adopted the sentiment analysis of electoral domain as our case study, considering the complexity and difficulty for manually label millions of political social media opinions during the short period of campaigns. Our results suggest that dataset similarity may be considered, even when datasets belong to different languages, to minimize negative effects that may occur due to domain shift in sentiment classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/.

  2. 2.

    https://www.github.com/.

  3. 3.

    https://datasetsearch.research.google.com/.

  4. 4.

    https://tfhub.dev/google/universal-sentence-encoder-multilingual/3.

  5. 5.

    https://github.com/sjessicasoaress/ds_recommender.

  6. 6.

    https://bit.ly/3FC3qC5.

References

  1. Al-Moslmi, T., Omar, N., Abdullah, S., Albared, M.: Approaches to cross-domain sentiment analysis: A systematic literature review. IEEE Access 5, 16173–16192 (2017)

    Article  Google Scholar 

  2. Bilal, M., Gani, A., Marjani, M., Malik, N.: Predicting elections: social media data and techniques. In: 2019 International Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6. IEEE (2019)

    Google Scholar 

  3. Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128 (2006)

    Google Scholar 

  4. Calais Guerra, P.H., Veloso, A., Meira Jr, W., Almeida, V.: From bias to opinion: a transfer-learning approach to real-time sentiment analysis. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158 (2011)

    Google Scholar 

  5. Chidambaram, M., et al.: Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836 (2018)

  6. Dai, X., Karimi, S., Hachey, B., Paris, C.: Using similarity measures to select pretraining data for NER. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1460–1470 (2019)

    Google Scholar 

  7. Elsahar, H., Gallé, M.: To annotate or not? Predicting performance drop under domain shift. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 2163–2173 (2019)

    Google Scholar 

  8. Fan, W., Davidson, I.: Reverse testing: an efficient framework to select amongst classifiers under sample selection bias. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 147–156 (2006)

    Google Scholar 

  9. Ghani, N.A., Hamid, S., Hashem, I.A.T., Ahmed, E.: Social media big data analytics: a survey. Comput. Hum. Behav. 101, 417–428 (2019)

    Article  Google Scholar 

  10. Joshi, M., Prajapati, P., Shaikh, A., Vala, V.: A survey on sentiment analysis. Int. J. Comput. Appl. 163(6), 34–38 (2017)

    Google Scholar 

  11. Kouw, W.M., Loog, M.: An introduction to domain adaptation and transfer learning. arXiv preprint arXiv:1812.11806 (2018)

  12. Li, N., Zhai, S., Zhang, Z., Liu, B.: Structural correspondence learning for cross-lingual sentiment classification with one-to-many mappings. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)

    Google Scholar 

  13. Li, Y., Guo, H., Zhang, Q., Gu, M., Yang, J.: Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowl. Based Syst. 160, 1–15 (2018)

    Article  Google Scholar 

  14. Liu, B.: Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Studies in Natural Language Processing, 2 edn. Cambridge University Press, (2020). https://doi.org/10.1017/9781108639286

  15. Mahendiran, A., et al.: Discovering evolving political vocabulary in social media. In: 2014 International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC2014), pp. 1–7. IEEE (2014)

    Google Scholar 

  16. Pan, S.J., Ni, X., Sun, J.T., Yang, Q., Chen, Z.: Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of the 19th International Conference on World Wide Web, pp. 751–760 (2010)

    Google Scholar 

  17. Santos, J.S., Bernardini, F., Paes, A.: Measuring the degree of divergence when labeling tweets in the electoral scenario. In: Anais do X Brazilian Workshop on Social Network Analysis and Mining. pp. 127–138. SBC (2021)

    Google Scholar 

  18. Santos, J.S., Bernardini, F., Paes, A.: A survey on the use of data and opinion mining in social media to political electoral outcomes prediction. Soc. Netw. Anal. Min. 11(1), 1–39 (2021)

    Article  Google Scholar 

  19. Santos, J.S., Paes, A., Bernardini, F.: Combining labeled datasets for sentiment analysis from different domains based on dataset similarity to predict electors sentiment. In: Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 455–460. IEEE (2019)

    Google Scholar 

  20. Wu, F., Huang, Y.: Sentiment domain adaptation with multiple sources. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 301–310 (2016)

    Google Scholar 

  21. Wu, F., Huang, Y., Yuan, Z.: Domain-specific sentiment classification via fusing sentiment knowledge from multiple sources. Inf. Fus. 35, 26–37 (2017)

    Article  Google Scholar 

  22. Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307 (2019)

  23. Zhang, Y., Hu, X., Li, P., Li, L., Wu, X.: Cross-domain sentiment classification-feature divergence, polarity divergence or both? Pattern Recogn. Lett. 65, 44–50 (2015)

    Article  Google Scholar 

  24. Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.: Cross validation framework to choose amongst models and datasets for transfer learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 547–562. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_35

    Chapter  Google Scholar 

Download references

Acknowledgement

This research was supported by the Brazilian Research CNPq APQ Universal (Grant 421608/2018-8), CNPq Research Grant 311275/2020-6, FAPERJ Research grant E26/202.914/2019 (247109), Microsoft Research Grant and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jéssica Soares dos Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

dos Santos, J.S., Bernardini, F., Paes, A. (2022). Similarity-Based Dataset Recommendation Across Languages and Domains to Sentiment Analysis in the Electoral Domain. In: Krimmer, R., et al. Electronic Participation. ePart 2022. Lecture Notes in Computer Science, vol 13392. Springer, Cham. https://doi.org/10.1007/978-3-031-23213-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23213-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23212-1

  • Online ISBN: 978-3-031-23213-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics