Towards Ensemble-Based Imbalanced Text Classification Using Metric Learning

Komamizu, Takahiro

doi:10.1007/978-3-031-39821-6_15

Takahiro Komamizu ORCID: orcid.org/0000-0002-3041-4330¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14147))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

412 Accesses

Abstract

This paper reports s series of ensemble approaches for imbalance text classification. All the approaches utilize a metric learning technique for obtaining better representations of texts to train weak classifiers. Each approach deals with the class imbalance problem with an undersampling-based ensemble approach, because metric learning techniques also suffer from this problem. In this paper, four ensemble approaches (namely, MLBagging, MLBoosting, MLStacking, and MLBoostacking) are proposed, three of which are corresponding to ensemble frameworks (namely, bagging, boosting, and stacking), and the other is a combination of boosting and stacking. MLBagging, MLBoosting, and MLStacking train metric learners on the individual undersampled dataset and combine them, while MLBoostacking trains metric learners in a step-by-step manner; that is, a metric learner learns a feature transformation so that failed-to-classify samples in the previous step should be correctly classified. The experimental evaluation on three imbalanced text classification datasets (namely, unfair statement classification in terms of service, hate speech detection in a forum, and hate speech tweet detection) shows that the proposed approaches lift classification performance from BERT-based approaches, by improving the representations of texts through metric learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: ICLR 2017 (2017)
Google Scholar
Budhi, G.S., Chiong, R., Wang, Z.: Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features. Multimed. Tools Appl. 80(9), 13079–13097 (2021)
Article Google Scholar
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: LEGAL-BERT: the Muppets straight out of law school. In: EMNLP 2020, pp. 2898–2904 (2020)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, pp. 4171–4186 (2019)
Google Scholar
Fernández, A., García, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Article MathSciNet MATH Google Scholar
Fu, Z., Tan, X., Peng, N., Zhao, D., Yan, R.: Style transfer in text: exploration and evaluation. In: McIlraith, S.A., Weinberger, K.Q. (eds.) AAAI 2018, pp. 663–670 (2018)
Google Scholar
Gazzah, S., Amara, N.E.B.: New oversampling approaches based on polynomial fitting for imbalanced data sets. In: DAS 2008, pp. 677–684 (2008)
Google Scholar
de Gibert, O., Perez, N., García-Pablos, A., Cuadros, M.: Hate speech dataset from a white supremacy forum. In: ALW2@EMNLP 2018, pp. 11–20 (2018)
Google Scholar
Glazkova, A.: A comparison of synthetic oversampling methods for multi-class text classification. CoRR abs/2008.04636 (2020)
Google Scholar
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. CoRR abs/2006.03654 (2020)
Google Scholar
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations (2021). http://openreview.net/forum?id=XPZIaotutsD
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: FastText.zip: compressing text classification models. CoRR abs/1612.03651 (2016)
Google Scholar
Komamizu, T.: Combining multi-ratio undersampling and metric learning for imbalanced classification. J. Data Intell. 2(4), 462–474 (2021)
Article Google Scholar
Komamizu, T., Ogawa, Y., Toyama, K.: An ensemble framework of multi-ratio undersampling-based imbalanced classification. JDI 2(1), 30–46 (2021)
Article Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR 2020 (2020)
Google Scholar
Li, Q., et al.: A survey on text classification: from traditional to deep learning. ACM Trans. Intell. Syst. Technol. 13(2), 31:1–31:41 (2022)
Google Scholar
Lippi, M., et al.: CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artif. Intell. Law 27(2), 117–139 (2019)
Article Google Scholar
Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39(2), 539–550 (2009)
Article Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR 2019 (2019)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR 2013 (2013)
Google Scholar
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. 54(3), 62:1–62:40 (2022)
Google Scholar
Musgrave, K., Belongie, S., Lim, S.-N.: A metric learning reality check. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 681–699. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_41
Chapter Google Scholar
Nie, W., Narodytska, N., Patel, A.: RELGAN: relational generative adversarial networks for text generation. In: ICLR 2019. OpenReview.net (2019)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: EMNLP-IJCNLP 2019, pp. 3980–3990 (2019)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019)
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A 40(1), 185–197 (2010)
Article Google Scholar
Sharma, R.: Twitter-Sentiment-Analysis (2019). http://github.com/sharmaroshan/Twitter-Sentiment-Analysis
Tian, J., et al.: Re-embedding difficult samples via mutual information constrained semantically oversampling for imbalanced text classification. In: EMNLP 2021, pp. 3148–3161 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS 2017, pp. 5998–6008 (2017)
Google Scholar
Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: CVPR 2014, pp. 1386–1393 (2014)
Google Scholar
Waseem, Z., Hovy, D.: Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In: SRW@NAACL 2016, San Diego, California, pp. 88–93 (2016)
Google Scholar
Yin, J., Gan, C., Zhao, K., Lin, X., Quan, Z., Wang, Z.: A novel model for imbalanced data classification. In: AAAI 2020, pp. 6680–6687 (2020)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: ICCV 2015, pp. 19–27 (2015)
Google Scholar

Download references

Acknowledgement

This work was partly supported by JSPS KAKENHI JP21H0355.

Author information

Authors and Affiliations

Nagoya University, Nagoya, Japan
Takahiro Komamizu

Authors

Takahiro Komamizu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takahiro Komamizu .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Tsukuba, Ibaraki, Japan
Toshiyuki Amagasa
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Komamizu, T. (2023). Towards Ensemble-Based Imbalanced Text Classification Using Metric Learning. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14147. Springer, Cham. https://doi.org/10.1007/978-3-031-39821-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-39821-6_15
Published: 16 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39820-9
Online ISBN: 978-3-031-39821-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics