poster

Data Selection Driven by Item Difficulty: On Investigating Data Efficient Practice for Hyperparameter Search

Authors:

Gustavo Rodrigues Dos Reis,

Adrian Mos,

Mario Cortes-Cornax,

Cyril LabbeAuthors Info & Claims

CAIN 2024: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI

Pages 275 - 277

https://doi.org/10.1145/3644815.3644978

Published: 11 June 2024 Publication History

Get Access

CAIN 2024: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI

Data Selection Driven by Item Difficulty: On Investigating Data Efficient Practice for Hyperparameter Search

Pages 275 - 277

Abstract
References

Abstract

Foundation Models shift the interest to adapting models instead of creating proprietary models from scratch. Despite this change, performing hyperparameter optimization (HPO) is still needed. Users adapting systems powered by those models on proprietary data should not considerably increase the overall resource footprint with extensive hyperparameter search. Given that this footprint is also proportional to the data used in HPO, we aim to investigate how a user can effectively reduce the amount of data used, leveraging the deep learning model's empirical facility to output the expected correct result for an item in the dataset.

In this work, we describe a methodology for accomplishing this data reduction through estimating a measure of an item's difficulty. This method allows keeping only a portion of data that conserves the overall proportions of item difficulty throughout the dataset while helping order them meaningfully. The rationale is derived from results from curriculum learning research as we try to answer if the adapted models could help organize and select subsets of data representative of the whole. Preliminary results of evaluating the method are provided for image recognition and scientific name entity recognition (NER). We observe that the amount of data for HPO can be reduced as far as 60% and still point to the same choice of hyperparameters compared to using the whole training set.

References

[1]

Adadi, A. A survey on data-efficient algorithms in big data era. J Big Data 8, 1 (Jan. 2021), 24.

Crossref

Google Scholar

[2]

Beltagy, I., Lo, K., and Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Hong Kong, China, 2019), Association for Computational Linguistics, pp. 3613--3618.

Crossref

Google Scholar

[3]

Bommasani, R., Hudson, D. A., Adeli, E., Altaian, R., Arora, S., von Arx, S., and Bernstein, e. a. On the Opportunities and Risks of Foundation Models, July 2022. arXiv:2108.07258 [cs].

Google Scholar

[4]

Collier, N., Ohta, T., Tsuruoka, Y., Tateisi, Y., and Kim, J.-D. Introduction to the Bio-entity Recognition Task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP) (Geneva, Switzerland, Aug. 2004), N. Collier, P. Ruch, and A. Nazarenko, Eds., COLING, pp. 73--78.

Google Scholar

[5]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. arXiv:2010.11929 [cs].

Google Scholar

[6]

He, X., Zhao, K., and Chu, X. AutoML: A Survey of the State-of-the-Art. Knowledge-Based Systems 212 (Jan. 2021), 106622. arXiv:1908.00709 [cs, stat].

Crossref

Google Scholar

[7]

Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization, 2017.

Google Scholar

[8]

Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H., Leaman, R., Davis, A. P., Mattingly, C. J., Wiegers, T. C., and Lu, Z. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (Jan. 2016), baw068.

Google Scholar

[9]

Neutatz, F., Chen, B., Alkhatib, Y., Ye, J., and Abedjan, Z. Data Cleaning and AutoML: Would an Optimizer Choose to Clean? Datenbank Spektrum 22, 2 (July 2022), 121--130.

Crossref

Google Scholar

[10]

Nilsback, M.-E., and Zisserman, A. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing (Dec 2008).

Digital Library

Google Scholar

[11]

Rafat, K., Islam, S., Mahfug, A. A., Hossain, M. I., Rahman, F., Momen, S., Rahman, S., and Mohammed, N. Mitigating carbon footprint for knowledge distillation based deep learning model compression. PLoS ONE 18, 5 (May 2023), e0285668.

Crossref

Google Scholar

[12]

Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green AI. Commun. ACM 63, 12 (Nov. 2020), 54--63.

Digital Library

Google Scholar

[13]

Soviany, P., Ionescu, R. T., Rota, P., and Sebe, N. Curriculum Learning: A Survey, Apr. 2022. arXiv:2101.10382 [cs].

Digital Library

Google Scholar

Recommendations

Refining User and Item Profiles based on Multidimensional Data for Top-N Item Recommendation
iiWAS '14: Proceedings of the 16th International Conference on Information Integration and Web-based Applications & Services

In recommender systems based on multidimensional data, additional metadata provides algorithms with more information for better understanding the interaction between users and items. However, most of the profiling approaches in neighbourhood-based ...
Dynamic data driven simulation with soft data
DEVS '14: Proceedings of the Symposium on Theory of Modeling & Simulation - DEVS Integrative

Dynamic data driven simulation dynamically assimilates observation data at runtime to improve the simulation results. Typically, the observations are "hard data" that are data collected from sensors. In this paper we consider dynamic data driven ...
Item recommendation in collaborative tagging systems via heuristic data fusion

Collaborative tagging systems have been popular on the Web. However, information overload results in the increasing need for recommender services from users, and thus item recommendation has been one of the key issues in such systems. In this paper, we ...

Comments

Information & Contributors

Information

Published In

CAIN '24: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI

April 2024

307 pages

ISBN:9798400705915

DOI:10.1145/3644815

Chair:
Jane Cleland-Huang,
Co-chair:
Jan Bosch,
Program Chair:
Henry Muccini,
Program Co-chair:
Grace Lewis

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2024

Check for updates

Qualifiers

Poster

Conference

CAIN 2024

Sponsor:

SIGSOFT

CAIN 2024: IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI

April 14 - 15, 2024

Lisbon, Portugal

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
33
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)7

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Recommendations

Refining User and Item Profiles based on Multidimensional Data for Top-N Item Recommendation

Dynamic data driven simulation with soft data

Item recommendation in collaborative tagging systems via heuristic data fusion

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations