research-article

An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text Classification

Authors:
Washington Cunha

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil

0000-0002-1988-8412
View Profile

,
Celso França

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil

0000-0002-0251-7172
View Profile

,
Guilherme Fonseca

Federal University of São João del Rei, São João del Rei, Brazil

Federal University of São João del Rei, São João del Rei, Brazil

0009-0000-7862-8701
View Profile

,
Leonardo Rocha

Federal University of São João del Rei, São João del Rei, Brazil

Federal University of São João del Rei, São João del Rei, Brazil

0000-0002-4913-4902
View Profile

,
Marcos André Gonçalves

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil

0000-0002-2075-3363
View Profile

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2023Pages 665–674https://doi.org/10.1145/3539618.3591638

Published:18 July 2023Publication History

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 665–674

ABSTRACT

Transformer-based deep learning is currently the state-of-the-art in many NLP and IR tasks. However, fine-tuning such Transformers for specific tasks, especially in scenarios of ever-expanding volumes of data with constant re-training requirements and budget constraints, is costly (computationally and financially) and energy-consuming. In this paper, we focus on Instance Selection (IS) - a set of methods focused on selecting the most representative documents for training, aimed at maintaining (or improving) classification effectiveness while reducing total time for training (or fine-tuning). We propose E2SC-IS -- Effective, Efficient, and Scalable Confidence-Based IS -- a two-step framework with a particular focus on Transformers and large datasets. E2SC-IS estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. E2SC-IS also exploits iterative heuristics to estimate a near-optimal reduction rate. Our solution can reduce the training sets by 29% on average while maintaining the effectiveness in all datasets, with speedup gains up to 70%, scaling for very large datasets (something that the baselines cannot do).

References

David W Aha, Dennis Kibler, and Marc K Albert. 1991. Instance-based learning algorithms. Machine learning, Vol. 6, 1 (1991), 37--66.Google Scholar
Fabiano M Belem, Rodrigo M Silva, Claudio MV de Andrade, Gabriel Person, Felipe Mingote, Raphael Ballet, Helton Alponti, Henrique P de Oliveira, Jussara M Almeida, and Marcos A Goncalves. 2020. "Fixing the curse of the bad product descriptions"--Search-boosted tag recommendation for E-commerce products. Information Processing & Management, Vol. 57, 5 (2020), 102289.Google ScholarCross Ref
Glenn W Brier et al. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, Vol. 78, 1 (1950), 1--3.Google Scholar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems.Google Scholar
Jose Camacho-Collados and Mohammad Taher Pilehvar. 2018. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 40--46.Google ScholarCross Ref
Sergio Canuto, Thiago Salles, Thierson C Rosa, and Marcos A Goncc alves. 2019. Similarity-based synthetic document representations for meta-feature generation in text classification. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 355--364.Google ScholarDigital Library
Joel Luís Carbonera and Mara Abel. 2018. Efficient Instance Selection Based on Spatial Abstraction. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI). 286--292. https://doi.org/10.1109/ICTAI.2018.00053Google ScholarCross Ref
Thiago Cardoso, Rodrigo Silva, Sérgio Canuto, Mirella Moro, and Marcos Goncc alves. 2017. Ranked batch-mode active learning. Info. Sciences (2017).Google Scholar
Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. 2021. AutoDebias: Learning to Debias for Recommendation. In Proc. of the ACM SIGIR Conference on Information Retrieval (SIGIR '21). 21--30.Google ScholarDigital Library
Washington Cunha, Sérgio Canuto, Felipe Viegas, Thiago Salles, Christian Gomes, Vitor Mangaravite, Elaine Resende, Thierson Rosa, Marcos André Gonçalves, and Leonardo Rocha. 2020. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management (IP&M), Vol. 57, 4 (2020), 102263.Google ScholarCross Ref
Washington Cunha, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine Resende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Martins, Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2021. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, Vol. 58, 3 (2021), 102481.Google ScholarDigital Library
Washington Cunha, Felipe Viegas, Celso Francca, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2023. A Comparative Survey of Instance Selection Methods Applied to NonNeural and Transformer-Based Text Classification. ACM Comput. Surv. (jan 2023). https://doi.org/10.1145/3582000Google ScholarDigital Library
Claudio M.V. de Andrade, Fabiano M. Belém, Washington Cunha, Celso França, Felipe Viegas, Leonardo Rocha, and Marcos André Gonçalves. 2023. On the class separability of contextual embeddings representations -- or "The classifier does not matter when the (text) representation is so good!". Information Processing & Management, Vol. 60, 4 (2023), 103336. https://doi.org/10.1016/j.ipm.2023.103336Google ScholarDigital Library
Bart Desmet and Véronique Hoste. 2018. Online suicide prevention through optimised text classification. Information Sciences, Vol. 439--440 (2018), 61--78.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL).Google Scholar
Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding Dataset Difficulty with $mathcalV$-Usable Information. In Proceedings of the 39th International Conference on Machine Learning, Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.), Vol. 162. PMLR.Google Scholar
Salvador Garcia, Joaquin Derrac, Jose Cano, and Francisco Herrera. 2012. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence, Vol. 34, 3 (2012).Google Scholar
Siddhant Garg, Goutham Ramakrishnan, and Varun Thumbe. 2021. Towards Robustness to Label Noise in Text Classification via Noise Modeling. In Proceedings of the 30th ACM International CIKM'21. 3024--3028.Google ScholarDigital Library
Xiao Han, Yuqi Liu, and Jimmy Lin. 2021. The simplest thing that can possibly work:(pseudo-) relevance feedback via text classification. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval.Google ScholarDigital Library
Peter Hart. 1968. The condensed nearest neighbor rule (Corresp.). IEEE transactions on information theory, Vol. 14, 3 (1968), 515--516.Google ScholarDigital Library
Yosef Hochberg. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, Vol. 75, 4 (1988).Google ScholarCross Ref
David Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 329--338.Google ScholarDigital Library
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the Conference European Chapter Association Computational Linguistics (EACL). 427--431.Google ScholarCross Ref
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).Google Scholar
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th ACL. 7871--7880.Google ScholarCross Ref
Enrique Leyva, Antonio González, and Raúl Pérez. 2015. Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective. Pattern Recognition, Vol. 48, 4 (2015), 1523--1537.Google ScholarDigital Library
Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S Yu, and Lifang He. 2022. A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology (2022).Google Scholar
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint 1907.11692 (2019).Google Scholar
Zhiwei Liu, Yingtong Dou, Philip S. Yu, Yutong Deng, and Hao Peng. 2020. Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR '20). 1569--1572. https://doi.org/10.1145/3397271.3401253Google ScholarDigital Library
Washington Luiz, Felipe Viegas, Rafael Alencar, Fernando Mour ao, Thiago Salles, Dárlinton Carvalho, Marcos Andre Goncc alves, and Leonardo Rocha. 2018. A feature-oriented sentiment rating for mobile app reviews. In Proceedings of the 2018 World Wide Web Conference. 1909--1918.Google ScholarDigital Library
Mohamed Malhat, Mohamed El Menshawy, Hamdy Mousa, and Ashraf El Sisi. 2020. A new approach for instance selection: Algorithms, evaluation, and comparisons. Expert Systems with Applications, Vol. 149 (2020), 113297.Google ScholarCross Ref
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 4 (2018), 824--836.Google Scholar
Luiz Felipe Mendes, Marcos André Gonçalves, Washington Cunha, Leonardo C. da Rocha, Thierson Couto Rosa, and Wellington Martins. 2020. "Keep it Simple, Lazy" MetaLazy: A New MetaStrategy for Lazy Text Classification. In CIKM '20.Google ScholarDigital Library
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep Learning-Based Text Classification: A Comprehensive Review. ACM Comput. Surv., Vol. 54, 3, Article 62 (apr 2021), 40 pages.Google Scholar
Michal Moran, Tom Cohen, Yuval Ben-Zion, and Goren Gordon. 2022. Curious instance selection. Information Sciences, Vol. 608 (2022), 794--808.Google ScholarDigital Library
Fernando Mour a o, Leonardo Rocha, Renata Braga Araújo, Thierson Couto, Marcos André Gonçalves, and Wagner Meira Jr. 2008. Understanding temporal aspects in document classification. In Proceedings of the International Conference on Web Search and Web Data Mining, WSDM. ACM, 159--170.Google ScholarDigital Library
Andrew Ng. 2016. Nuts and bolts of building AI applications using Deep Learning. NIPS Keynote Talk (2016).Google Scholar
Sivaramakrishnan Rajaraman, Prasanth Ganesan, and Sameer Antani. 2022. Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one, Vol. 17, 1 (2022), e0262838.Google ScholarCross Ref
Abhinaba Roy and Erik Cambria. 2022. Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems, Vol. 245 (2022), 108346.Google ScholarDigital Library
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google Scholar
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., Vol. 34, 1 (2002), 1--47.Google ScholarDigital Library
Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, R. Venkatesh Babu, and Vishal M. Patel. 2020. Learning to Count in the Crowd from Limited Labeled Data. In Computer Vision -- ECCV. Cham, 212--229.Google Scholar
Marina Sokolova and Guy Lapalme. 2009. A Systematic Analysis of Performance Measures for Classification Tasks. Information Processing & Management (IP&M), Vol. 45, 4 (July 2009), 427--437. https://doi.org/10.1016/j.ipm.2009.03.002Google ScholarDigital Library
Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 505--514.Google ScholarDigital Library
Dennis L Wilson. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics (1972), 408--421.Google ScholarCross Ref
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NIPS, Vol. 32. 5754--5764.Google Scholar
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2016. Character-level Convolutional Networks for Text Classification. In NIPS´16. Vol. 28. 649--657.Google Scholar

Index Terms

An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text Classification
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering

Recommendations

Instance selection in semi-supervised learning
Canadian AI'11: Proceedings of the 24th Canadian conference on Advances in artificial intelligence

Semi-supervised learning methods utilize abundant unlabeled data to help to learn a better classifier when the number of labeled instances is very small. A common method is to select and label unlabeled instances that the current classifier has high ...
Read More
Instance selection method for improving graph-based semi-supervised learning
PRICAI'16: Proceedings of the 14th Pacific Rim International Conference on Trends in Artificial Intelligence

Graph-based semi-supervised learning (GSSL) is one of the most important semi-supervised learning (SSL) paradigms. Though GSSL methods are helpful in many situations, they may hurt performance when using unlabeled data. In this paper, we propose a new ...
Read More
Robust multiple-instance learning ensembles using random subspace instance selection

Many real-world pattern recognition problems can be modeled using multiple-instance learning (MIL), where instances are grouped into bags, and each bag is assigned a label. State-of-the-art MIL methods provide a high level of performance when strong ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
instance selection
transformer-based text classification
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 245
  Total Downloads
- Downloads (Last 12 months)245
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text Classification

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Instance selection in semi-supervised learning

Instance selection method for improving graph-based semi-supervised learning

Robust multiple-instance learning ensembles using random subspace instance selection