skip to main content
10.1145/3539618.3591638acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text Classification

Published:18 July 2023Publication History

ABSTRACT

Transformer-based deep learning is currently the state-of-the-art in many NLP and IR tasks. However, fine-tuning such Transformers for specific tasks, especially in scenarios of ever-expanding volumes of data with constant re-training requirements and budget constraints, is costly (computationally and financially) and energy-consuming. In this paper, we focus on Instance Selection (IS) - a set of methods focused on selecting the most representative documents for training, aimed at maintaining (or improving) classification effectiveness while reducing total time for training (or fine-tuning). We propose E2SC-IS -- Effective, Efficient, and Scalable Confidence-Based IS -- a two-step framework with a particular focus on Transformers and large datasets. E2SC-IS estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. E2SC-IS also exploits iterative heuristics to estimate a near-optimal reduction rate. Our solution can reduce the training sets by 29% on average while maintaining the effectiveness in all datasets, with speedup gains up to 70%, scaling for very large datasets (something that the baselines cannot do).

References

  1. David W Aha, Dennis Kibler, and Marc K Albert. 1991. Instance-based learning algorithms. Machine learning, Vol. 6, 1 (1991), 37--66.Google ScholarGoogle Scholar
  2. Fabiano M Belem, Rodrigo M Silva, Claudio MV de Andrade, Gabriel Person, Felipe Mingote, Raphael Ballet, Helton Alponti, Henrique P de Oliveira, Jussara M Almeida, and Marcos A Goncalves. 2020. "Fixing the curse of the bad product descriptions"--Search-boosted tag recommendation for E-commerce products. Information Processing & Management, Vol. 57, 5 (2020), 102289.Google ScholarGoogle ScholarCross RefCross Ref
  3. Glenn W Brier et al. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, Vol. 78, 1 (1950), 1--3.Google ScholarGoogle Scholar
  4. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  5. Jose Camacho-Collados and Mohammad Taher Pilehvar. 2018. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 40--46.Google ScholarGoogle ScholarCross RefCross Ref
  6. Sergio Canuto, Thiago Salles, Thierson C Rosa, and Marcos A Goncc alves. 2019. Similarity-based synthetic document representations for meta-feature generation in text classification. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 355--364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Joel Luís Carbonera and Mara Abel. 2018. Efficient Instance Selection Based on Spatial Abstraction. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI). 286--292. https://doi.org/10.1109/ICTAI.2018.00053Google ScholarGoogle ScholarCross RefCross Ref
  8. Thiago Cardoso, Rodrigo Silva, Sérgio Canuto, Mirella Moro, and Marcos Goncc alves. 2017. Ranked batch-mode active learning. Info. Sciences (2017).Google ScholarGoogle Scholar
  9. Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. 2021. AutoDebias: Learning to Debias for Recommendation. In Proc. of the ACM SIGIR Conference on Information Retrieval (SIGIR '21). 21--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Washington Cunha, Sérgio Canuto, Felipe Viegas, Thiago Salles, Christian Gomes, Vitor Mangaravite, Elaine Resende, Thierson Rosa, Marcos André Gonçalves, and Leonardo Rocha. 2020. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management (IP&M), Vol. 57, 4 (2020), 102263.Google ScholarGoogle ScholarCross RefCross Ref
  11. Washington Cunha, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine Resende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Martins, Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2021. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, Vol. 58, 3 (2021), 102481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Washington Cunha, Felipe Viegas, Celso Francca, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2023. A Comparative Survey of Instance Selection Methods Applied to NonNeural and Transformer-Based Text Classification. ACM Comput. Surv. (jan 2023). https://doi.org/10.1145/3582000Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Claudio M.V. de Andrade, Fabiano M. Belém, Washington Cunha, Celso França, Felipe Viegas, Leonardo Rocha, and Marcos André Gonçalves. 2023. On the class separability of contextual embeddings representations -- or "The classifier does not matter when the (text) representation is so good!". Information Processing & Management, Vol. 60, 4 (2023), 103336. https://doi.org/10.1016/j.ipm.2023.103336Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bart Desmet and Véronique Hoste. 2018. Online suicide prevention through optimised text classification. Information Sciences, Vol. 439--440 (2018), 61--78.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL).Google ScholarGoogle Scholar
  16. Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding Dataset Difficulty with $mathcalV$-Usable Information. In Proceedings of the 39th International Conference on Machine Learning, Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.), Vol. 162. PMLR.Google ScholarGoogle Scholar
  17. Salvador Garcia, Joaquin Derrac, Jose Cano, and Francisco Herrera. 2012. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence, Vol. 34, 3 (2012).Google ScholarGoogle Scholar
  18. Siddhant Garg, Goutham Ramakrishnan, and Varun Thumbe. 2021. Towards Robustness to Label Noise in Text Classification via Noise Modeling. In Proceedings of the 30th ACM International CIKM'21. 3024--3028.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Xiao Han, Yuqi Liu, and Jimmy Lin. 2021. The simplest thing that can possibly work:(pseudo-) relevance feedback via text classification. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Peter Hart. 1968. The condensed nearest neighbor rule (Corresp.). IEEE transactions on information theory, Vol. 14, 3 (1968), 515--516.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yosef Hochberg. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, Vol. 75, 4 (1988).Google ScholarGoogle ScholarCross RefCross Ref
  22. David Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 329--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the Conference European Chapter Association Computational Linguistics (EACL). 427--431.Google ScholarGoogle ScholarCross RefCross Ref
  24. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).Google ScholarGoogle Scholar
  25. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th ACL. 7871--7880.Google ScholarGoogle ScholarCross RefCross Ref
  26. Enrique Leyva, Antonio González, and Raúl Pérez. 2015. Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective. Pattern Recognition, Vol. 48, 4 (2015), 1523--1537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S Yu, and Lifang He. 2022. A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology (2022).Google ScholarGoogle Scholar
  28. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint 1907.11692 (2019).Google ScholarGoogle Scholar
  29. Zhiwei Liu, Yingtong Dou, Philip S. Yu, Yutong Deng, and Hao Peng. 2020. Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR '20). 1569--1572. https://doi.org/10.1145/3397271.3401253Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Washington Luiz, Felipe Viegas, Rafael Alencar, Fernando Mour ao, Thiago Salles, Dárlinton Carvalho, Marcos Andre Goncc alves, and Leonardo Rocha. 2018. A feature-oriented sentiment rating for mobile app reviews. In Proceedings of the 2018 World Wide Web Conference. 1909--1918.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mohamed Malhat, Mohamed El Menshawy, Hamdy Mousa, and Ashraf El Sisi. 2020. A new approach for instance selection: Algorithms, evaluation, and comparisons. Expert Systems with Applications, Vol. 149 (2020), 113297.Google ScholarGoogle ScholarCross RefCross Ref
  32. Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 4 (2018), 824--836.Google ScholarGoogle Scholar
  33. Luiz Felipe Mendes, Marcos André Gonçalves, Washington Cunha, Leonardo C. da Rocha, Thierson Couto Rosa, and Wellington Martins. 2020. "Keep it Simple, Lazy" MetaLazy: A New MetaStrategy for Lazy Text Classification. In CIKM '20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep Learning-Based Text Classification: A Comprehensive Review. ACM Comput. Surv., Vol. 54, 3, Article 62 (apr 2021), 40 pages.Google ScholarGoogle Scholar
  35. Michal Moran, Tom Cohen, Yuval Ben-Zion, and Goren Gordon. 2022. Curious instance selection. Information Sciences, Vol. 608 (2022), 794--808.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Fernando Mour a o, Leonardo Rocha, Renata Braga Araújo, Thierson Couto, Marcos André Gonçalves, and Wagner Meira Jr. 2008. Understanding temporal aspects in document classification. In Proceedings of the International Conference on Web Search and Web Data Mining, WSDM. ACM, 159--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Andrew Ng. 2016. Nuts and bolts of building AI applications using Deep Learning. NIPS Keynote Talk (2016).Google ScholarGoogle Scholar
  38. Sivaramakrishnan Rajaraman, Prasanth Ganesan, and Sameer Antani. 2022. Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one, Vol. 17, 1 (2022), e0262838.Google ScholarGoogle ScholarCross RefCross Ref
  39. Abhinaba Roy and Erik Cambria. 2022. Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems, Vol. 245 (2022), 108346.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google ScholarGoogle Scholar
  41. Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., Vol. 34, 1 (2002), 1--47.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, R. Venkatesh Babu, and Vishal M. Patel. 2020. Learning to Count in the Crowd from Limited Labeled Data. In Computer Vision -- ECCV. Cham, 212--229.Google ScholarGoogle Scholar
  43. Marina Sokolova and Guy Lapalme. 2009. A Systematic Analysis of Performance Measures for Classification Tasks. Information Processing & Management (IP&M), Vol. 45, 4 (July 2009), 427--437. https://doi.org/10.1016/j.ipm.2009.03.002Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 505--514.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Dennis L Wilson. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics (1972), 408--421.Google ScholarGoogle ScholarCross RefCross Ref
  46. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NIPS, Vol. 32. 5754--5764.Google ScholarGoogle Scholar
  47. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2016. Character-level Convolutional Networks for Text Classification. In NIPS´16. Vol. 28. 649--657.Google ScholarGoogle Scholar

Index Terms

  1. An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2023
      3567 pages
      ISBN:9781450394086
      DOI:10.1145/3539618

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 July 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader