ABSTRACT
Transformer-based deep learning is currently the state-of-the-art in many NLP and IR tasks. However, fine-tuning such Transformers for specific tasks, especially in scenarios of ever-expanding volumes of data with constant re-training requirements and budget constraints, is costly (computationally and financially) and energy-consuming. In this paper, we focus on Instance Selection (IS) - a set of methods focused on selecting the most representative documents for training, aimed at maintaining (or improving) classification effectiveness while reducing total time for training (or fine-tuning). We propose E2SC-IS -- Effective, Efficient, and Scalable Confidence-Based IS -- a two-step framework with a particular focus on Transformers and large datasets. E2SC-IS estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. E2SC-IS also exploits iterative heuristics to estimate a near-optimal reduction rate. Our solution can reduce the training sets by 29% on average while maintaining the effectiveness in all datasets, with speedup gains up to 70%, scaling for very large datasets (something that the baselines cannot do).
- David W Aha, Dennis Kibler, and Marc K Albert. 1991. Instance-based learning algorithms. Machine learning, Vol. 6, 1 (1991), 37--66.Google Scholar
- Fabiano M Belem, Rodrigo M Silva, Claudio MV de Andrade, Gabriel Person, Felipe Mingote, Raphael Ballet, Helton Alponti, Henrique P de Oliveira, Jussara M Almeida, and Marcos A Goncalves. 2020. "Fixing the curse of the bad product descriptions"--Search-boosted tag recommendation for E-commerce products. Information Processing & Management, Vol. 57, 5 (2020), 102289.Google ScholarCross Ref
- Glenn W Brier et al. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, Vol. 78, 1 (1950), 1--3.Google Scholar
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems.Google Scholar
- Jose Camacho-Collados and Mohammad Taher Pilehvar. 2018. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 40--46.Google ScholarCross Ref
- Sergio Canuto, Thiago Salles, Thierson C Rosa, and Marcos A Goncc alves. 2019. Similarity-based synthetic document representations for meta-feature generation in text classification. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 355--364.Google ScholarDigital Library
- Joel Luís Carbonera and Mara Abel. 2018. Efficient Instance Selection Based on Spatial Abstraction. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI). 286--292. https://doi.org/10.1109/ICTAI.2018.00053Google ScholarCross Ref
- Thiago Cardoso, Rodrigo Silva, Sérgio Canuto, Mirella Moro, and Marcos Goncc alves. 2017. Ranked batch-mode active learning. Info. Sciences (2017).Google Scholar
- Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. 2021. AutoDebias: Learning to Debias for Recommendation. In Proc. of the ACM SIGIR Conference on Information Retrieval (SIGIR '21). 21--30.Google ScholarDigital Library
- Washington Cunha, Sérgio Canuto, Felipe Viegas, Thiago Salles, Christian Gomes, Vitor Mangaravite, Elaine Resende, Thierson Rosa, Marcos André Gonçalves, and Leonardo Rocha. 2020. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management (IP&M), Vol. 57, 4 (2020), 102263.Google ScholarCross Ref
- Washington Cunha, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine Resende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Martins, Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2021. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, Vol. 58, 3 (2021), 102481.Google ScholarDigital Library
- Washington Cunha, Felipe Viegas, Celso Francca, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2023. A Comparative Survey of Instance Selection Methods Applied to NonNeural and Transformer-Based Text Classification. ACM Comput. Surv. (jan 2023). https://doi.org/10.1145/3582000Google ScholarDigital Library
- Claudio M.V. de Andrade, Fabiano M. Belém, Washington Cunha, Celso França, Felipe Viegas, Leonardo Rocha, and Marcos André Gonçalves. 2023. On the class separability of contextual embeddings representations -- or "The classifier does not matter when the (text) representation is so good!". Information Processing & Management, Vol. 60, 4 (2023), 103336. https://doi.org/10.1016/j.ipm.2023.103336Google ScholarDigital Library
- Bart Desmet and Véronique Hoste. 2018. Online suicide prevention through optimised text classification. Information Sciences, Vol. 439--440 (2018), 61--78.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL).Google Scholar
- Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding Dataset Difficulty with $mathcalV$-Usable Information. In Proceedings of the 39th International Conference on Machine Learning, Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.), Vol. 162. PMLR.Google Scholar
- Salvador Garcia, Joaquin Derrac, Jose Cano, and Francisco Herrera. 2012. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence, Vol. 34, 3 (2012).Google Scholar
- Siddhant Garg, Goutham Ramakrishnan, and Varun Thumbe. 2021. Towards Robustness to Label Noise in Text Classification via Noise Modeling. In Proceedings of the 30th ACM International CIKM'21. 3024--3028.Google ScholarDigital Library
- Xiao Han, Yuqi Liu, and Jimmy Lin. 2021. The simplest thing that can possibly work:(pseudo-) relevance feedback via text classification. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval.Google ScholarDigital Library
- Peter Hart. 1968. The condensed nearest neighbor rule (Corresp.). IEEE transactions on information theory, Vol. 14, 3 (1968), 515--516.Google ScholarDigital Library
- Yosef Hochberg. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, Vol. 75, 4 (1988).Google ScholarCross Ref
- David Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 329--338.Google ScholarDigital Library
- Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the Conference European Chapter Association Computational Linguistics (EACL). 427--431.Google ScholarCross Ref
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).Google Scholar
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th ACL. 7871--7880.Google ScholarCross Ref
- Enrique Leyva, Antonio González, and Raúl Pérez. 2015. Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective. Pattern Recognition, Vol. 48, 4 (2015), 1523--1537.Google ScholarDigital Library
- Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S Yu, and Lifang He. 2022. A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology (2022).Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint 1907.11692 (2019).Google Scholar
- Zhiwei Liu, Yingtong Dou, Philip S. Yu, Yutong Deng, and Hao Peng. 2020. Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR '20). 1569--1572. https://doi.org/10.1145/3397271.3401253Google ScholarDigital Library
- Washington Luiz, Felipe Viegas, Rafael Alencar, Fernando Mour ao, Thiago Salles, Dárlinton Carvalho, Marcos Andre Goncc alves, and Leonardo Rocha. 2018. A feature-oriented sentiment rating for mobile app reviews. In Proceedings of the 2018 World Wide Web Conference. 1909--1918.Google ScholarDigital Library
- Mohamed Malhat, Mohamed El Menshawy, Hamdy Mousa, and Ashraf El Sisi. 2020. A new approach for instance selection: Algorithms, evaluation, and comparisons. Expert Systems with Applications, Vol. 149 (2020), 113297.Google ScholarCross Ref
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 4 (2018), 824--836.Google Scholar
- Luiz Felipe Mendes, Marcos André Gonçalves, Washington Cunha, Leonardo C. da Rocha, Thierson Couto Rosa, and Wellington Martins. 2020. "Keep it Simple, Lazy" MetaLazy: A New MetaStrategy for Lazy Text Classification. In CIKM '20.Google ScholarDigital Library
- Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep Learning-Based Text Classification: A Comprehensive Review. ACM Comput. Surv., Vol. 54, 3, Article 62 (apr 2021), 40 pages.Google Scholar
- Michal Moran, Tom Cohen, Yuval Ben-Zion, and Goren Gordon. 2022. Curious instance selection. Information Sciences, Vol. 608 (2022), 794--808.Google ScholarDigital Library
- Fernando Mour a o, Leonardo Rocha, Renata Braga Araújo, Thierson Couto, Marcos André Gonçalves, and Wagner Meira Jr. 2008. Understanding temporal aspects in document classification. In Proceedings of the International Conference on Web Search and Web Data Mining, WSDM. ACM, 159--170.Google ScholarDigital Library
- Andrew Ng. 2016. Nuts and bolts of building AI applications using Deep Learning. NIPS Keynote Talk (2016).Google Scholar
- Sivaramakrishnan Rajaraman, Prasanth Ganesan, and Sameer Antani. 2022. Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one, Vol. 17, 1 (2022), e0262838.Google ScholarCross Ref
- Abhinaba Roy and Erik Cambria. 2022. Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems, Vol. 245 (2022), 108346.Google ScholarDigital Library
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google Scholar
- Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., Vol. 34, 1 (2002), 1--47.Google ScholarDigital Library
- Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, R. Venkatesh Babu, and Vishal M. Patel. 2020. Learning to Count in the Crowd from Limited Labeled Data. In Computer Vision -- ECCV. Cham, 212--229.Google Scholar
- Marina Sokolova and Guy Lapalme. 2009. A Systematic Analysis of Performance Measures for Classification Tasks. Information Processing & Management (IP&M), Vol. 45, 4 (July 2009), 427--437. https://doi.org/10.1016/j.ipm.2009.03.002Google ScholarDigital Library
- Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 505--514.Google ScholarDigital Library
- Dennis L Wilson. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics (1972), 408--421.Google ScholarCross Ref
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NIPS, Vol. 32. 5754--5764.Google Scholar
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2016. Character-level Convolutional Networks for Text Classification. In NIPS´16. Vol. 28. 649--657.Google Scholar
Index Terms
- An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text Classification
Recommendations
Instance selection in semi-supervised learning
Canadian AI'11: Proceedings of the 24th Canadian conference on Advances in artificial intelligenceSemi-supervised learning methods utilize abundant unlabeled data to help to learn a better classifier when the number of labeled instances is very small. A common method is to select and label unlabeled instances that the current classifier has high ...
Instance selection method for improving graph-based semi-supervised learning
PRICAI'16: Proceedings of the 14th Pacific Rim International Conference on Trends in Artificial IntelligenceGraph-based semi-supervised learning (GSSL) is one of the most important semi-supervised learning (SSL) paradigms. Though GSSL methods are helpful in many situations, they may hurt performance when using unlabeled data. In this paper, we propose a new ...
Robust multiple-instance learning ensembles using random subspace instance selection
Many real-world pattern recognition problems can be modeled using multiple-instance learning (MIL), where instances are grouped into bags, and each bag is assigned a label. State-of-the-art MIL methods provide a high level of performance when strong ...
Comments