skip to main content
10.1145/3644815.3644978acmconferencesArticle/Chapter ViewAbstractPublication PagescainConference Proceedingsconference-collections
poster

Data Selection Driven by Item Difficulty: On Investigating Data Efficient Practice for Hyperparameter Search

Published: 11 June 2024 Publication History

Abstract

Foundation Models shift the interest to adapting models instead of creating proprietary models from scratch. Despite this change, performing hyperparameter optimization (HPO) is still needed. Users adapting systems powered by those models on proprietary data should not considerably increase the overall resource footprint with extensive hyperparameter search. Given that this footprint is also proportional to the data used in HPO, we aim to investigate how a user can effectively reduce the amount of data used, leveraging the deep learning model's empirical facility to output the expected correct result for an item in the dataset.
In this work, we describe a methodology for accomplishing this data reduction through estimating a measure of an item's difficulty. This method allows keeping only a portion of data that conserves the overall proportions of item difficulty throughout the dataset while helping order them meaningfully. The rationale is derived from results from curriculum learning research as we try to answer if the adapted models could help organize and select subsets of data representative of the whole. Preliminary results of evaluating the method are provided for image recognition and scientific name entity recognition (NER). We observe that the amount of data for HPO can be reduced as far as 60% and still point to the same choice of hyperparameters compared to using the whole training set.

References

[1]
Adadi, A. A survey on data-efficient algorithms in big data era. J Big Data 8, 1 (Jan. 2021), 24.
[2]
Beltagy, I., Lo, K., and Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Hong Kong, China, 2019), Association for Computational Linguistics, pp. 3613--3618.
[3]
Bommasani, R., Hudson, D. A., Adeli, E., Altaian, R., Arora, S., von Arx, S., and Bernstein, e. a. On the Opportunities and Risks of Foundation Models, July 2022. arXiv:2108.07258 [cs].
[4]
Collier, N., Ohta, T., Tsuruoka, Y., Tateisi, Y., and Kim, J.-D. Introduction to the Bio-entity Recognition Task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP) (Geneva, Switzerland, Aug. 2004), N. Collier, P. Ruch, and A. Nazarenko, Eds., COLING, pp. 73--78.
[5]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. arXiv:2010.11929 [cs].
[6]
He, X., Zhao, K., and Chu, X. AutoML: A Survey of the State-of-the-Art. Knowledge-Based Systems 212 (Jan. 2021), 106622. arXiv:1908.00709 [cs, stat].
[7]
Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization, 2017.
[8]
Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H., Leaman, R., Davis, A. P., Mattingly, C. J., Wiegers, T. C., and Lu, Z. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (Jan. 2016), baw068.
[9]
Neutatz, F., Chen, B., Alkhatib, Y., Ye, J., and Abedjan, Z. Data Cleaning and AutoML: Would an Optimizer Choose to Clean? Datenbank Spektrum 22, 2 (July 2022), 121--130.
[10]
Nilsback, M.-E., and Zisserman, A. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing (Dec 2008).
[11]
Rafat, K., Islam, S., Mahfug, A. A., Hossain, M. I., Rahman, F., Momen, S., Rahman, S., and Mohammed, N. Mitigating carbon footprint for knowledge distillation based deep learning model compression. PLoS ONE 18, 5 (May 2023), e0285668.
[12]
Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green AI. Commun. ACM 63, 12 (Nov. 2020), 54--63.
[13]
Soviany, P., Ionescu, R. T., Rota, P., and Sebe, N. Curriculum Learning: A Survey, Apr. 2022. arXiv:2101.10382 [cs].

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CAIN '24: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI
April 2024
307 pages
ISBN:9798400705915
DOI:10.1145/3644815
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2024

Check for updates

Qualifiers

  • Poster

Conference

CAIN 2024
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 33
    Total Downloads
  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)7
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media