Skip to main content

Towards Ensemble-Based Imbalanced Text Classification Using Metric Learning

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14147))

Included in the following conference series:

  • 412 Accesses

Abstract

This paper reports s series of ensemble approaches for imbalance text classification. All the approaches utilize a metric learning technique for obtaining better representations of texts to train weak classifiers. Each approach deals with the class imbalance problem with an undersampling-based ensemble approach, because metric learning techniques also suffer from this problem. In this paper, four ensemble approaches (namely, MLBagging, MLBoosting, MLStacking, and MLBoostacking) are proposed, three of which are corresponding to ensemble frameworks (namely, bagging, boosting, and stacking), and the other is a combination of boosting and stacking. MLBagging, MLBoosting, and MLStacking train metric learners on the individual undersampled dataset and combine them, while MLBoostacking trains metric learners in a step-by-step manner; that is, a metric learner learns a feature transformation so that failed-to-classify samples in the previous step should be correctly classified. The experimental evaluation on three imbalanced text classification datasets (namely, unfair statement classification in terms of service, hate speech detection in a forum, and hate speech tweet detection) shows that the proposed approaches lift classification performance from BERT-based approaches, by improving the representations of texts through metric learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://claudette.eui.eu/ToS.zip.

  2. 2.

    https://huggingface.co/datasets/hate_speech18.

  3. 3.

    https://huggingface.co/datasets/tweets_hate_speech_detection.

  4. 4.

    https://huggingface.co/bert-large-uncased.

  5. 5.

    https://huggingface.co/nlpaueb/legal-bert-base-uncased.

  6. 6.

    https://huggingface.co/Narrativaai/deberta-v3-small-finetuned-hate_speech18.

  7. 7.

    https://huggingface.co/mrm8488/distilroberta-finetuned-tweets-hate-speech.

References

  1. Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: ICLR 2017 (2017)

    Google Scholar 

  2. Budhi, G.S., Chiong, R., Wang, Z.: Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features. Multimed. Tools Appl. 80(9), 13079–13097 (2021)

    Article  Google Scholar 

  3. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: LEGAL-BERT: the Muppets straight out of law school. In: EMNLP 2020, pp. 2898–2904 (2020)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, pp. 4171–4186 (2019)

    Google Scholar 

  5. Fernández, A., García, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  6. Fu, Z., Tan, X., Peng, N., Zhao, D., Yan, R.: Style transfer in text: exploration and evaluation. In: McIlraith, S.A., Weinberger, K.Q. (eds.) AAAI 2018, pp. 663–670 (2018)

    Google Scholar 

  7. Gazzah, S., Amara, N.E.B.: New oversampling approaches based on polynomial fitting for imbalanced data sets. In: DAS 2008, pp. 677–684 (2008)

    Google Scholar 

  8. de Gibert, O., Perez, N., García-Pablos, A., Cuadros, M.: Hate speech dataset from a white supremacy forum. In: ALW2@EMNLP 2018, pp. 11–20 (2018)

    Google Scholar 

  9. Glazkova, A.: A comparison of synthetic oversampling methods for multi-class text classification. CoRR abs/2008.04636 (2020)

    Google Scholar 

  10. He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. CoRR abs/2006.03654 (2020)

    Google Scholar 

  11. He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations (2021). http://openreview.net/forum?id=XPZIaotutsD

  12. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: FastText.zip: compressing text classification models. CoRR abs/1612.03651 (2016)

    Google Scholar 

  13. Komamizu, T.: Combining multi-ratio undersampling and metric learning for imbalanced classification. J. Data Intell. 2(4), 462–474 (2021)

    Article  Google Scholar 

  14. Komamizu, T., Ogawa, Y., Toyama, K.: An ensemble framework of multi-ratio undersampling-based imbalanced classification. JDI 2(1), 30–46 (2021)

    Article  Google Scholar 

  15. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR 2020 (2020)

    Google Scholar 

  16. Li, Q., et al.: A survey on text classification: from traditional to deep learning. ACM Trans. Intell. Syst. Technol. 13(2), 31:1–31:41 (2022)

    Google Scholar 

  17. Lippi, M., et al.: CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artif. Intell. Law 27(2), 117–139 (2019)

    Article  Google Scholar 

  18. Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39(2), 539–550 (2009)

    Article  Google Scholar 

  19. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)

    Google Scholar 

  20. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR 2019 (2019)

    Google Scholar 

  21. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR 2013 (2013)

    Google Scholar 

  22. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. 54(3), 62:1–62:40 (2022)

    Google Scholar 

  23. Musgrave, K., Belongie, S., Lim, S.-N.: A metric learning reality check. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 681–699. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_41

    Chapter  Google Scholar 

  24. Nie, W., Narodytska, N., Patel, A.: RELGAN: relational generative adversarial networks for text generation. In: ICLR 2019. OpenReview.net (2019)

    Google Scholar 

  25. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: EMNLP-IJCNLP 2019, pp. 3980–3990 (2019)

    Google Scholar 

  26. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019)

    Google Scholar 

  27. Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A 40(1), 185–197 (2010)

    Article  Google Scholar 

  28. Sharma, R.: Twitter-Sentiment-Analysis (2019). http://github.com/sharmaroshan/Twitter-Sentiment-Analysis

  29. Tian, J., et al.: Re-embedding difficult samples via mutual information constrained semantically oversampling for imbalanced text classification. In: EMNLP 2021, pp. 3148–3161 (2021)

    Google Scholar 

  30. Vaswani, A., et al.: Attention is all you need. In: NIPS 2017, pp. 5998–6008 (2017)

    Google Scholar 

  31. Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: CVPR 2014, pp. 1386–1393 (2014)

    Google Scholar 

  32. Waseem, Z., Hovy, D.: Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In: SRW@NAACL 2016, San Diego, California, pp. 88–93 (2016)

    Google Scholar 

  33. Yin, J., Gan, C., Zhao, K., Lin, X., Quan, Z., Wang, Z.: A novel model for imbalanced data classification. In: AAAI 2020, pp. 6680–6687 (2020)

    Google Scholar 

  34. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: ICCV 2015, pp. 19–27 (2015)

    Google Scholar 

Download references

Acknowledgement

This work was partly supported by JSPS KAKENHI JP21H0355.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takahiro Komamizu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Komamizu, T. (2023). Towards Ensemble-Based Imbalanced Text Classification Using Metric Learning. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14147. Springer, Cham. https://doi.org/10.1007/978-3-031-39821-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39821-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39820-9

  • Online ISBN: 978-3-031-39821-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics