Training data selection based on dataset distillation for rapid deployment in machine-learning workflows

Jeong, Yuna; Hwang, Myunggwon; Sung, Wonkyung

doi:10.1007/s11042-022-13701-6

Training data selection based on dataset distillation for rapid deployment in machine-learning workflows

1222: Intelligent Multimedia Data Analytics and Computing
Published: 05 September 2022

Volume 82, pages 9855–9870, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

269 Accesses
1 Altmetric
Explore all metrics

Abstract

Recently, nonlinear machine-learning models have been effectively applied to multimedia data, contributing greatly to various downstream tasks. However, large amounts of training data are required to properly train many parameters and achieve reasonable performance in nonlinear models. Using a large amount of data significantly increases time and cost, which are limited resources of model development and distribution processes. The goal of our study is to construct a core set that approximates the entire original dataset so that we can quickly observe performance changes caused by model redesign or parameter changes in machine learning deployment. The core set is mainly composed of informative samples with a high contribution to the train. We measure the contribution of the sample based on the dataset distillation and perform area-based sampling for generalization. The core set can be construct in a short time by measuring the learning contribution with only a small number of distilled images. The experimental results showed that our method selects more useful samples compared to random sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding High-Value Training Data Subset Through Differentiable Convex Programming

A survey on data‐efficient algorithms in big data era

Article Open access 26 January 2021

Faster learning by reduction of data access time

Article 24 July 2018

Data availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Code availability

Not applicable.

References

Agarwal PK, Har-Peled S, Varadarajan KR (2005) Geometric approximation via coresets. Comb Comput Geom 52:1–30
MathSciNet MATH Google Scholar
Bermudez I, Traverso S, Mellia M, Munafo M (2013) Exploring the cloud from passive measurements: the Amazon AWS case. 2013 proceedings IEEE INFOCOM. https://doi.org/10.1109/INFCOM.2013.6566769
Chang HS, Learned-Miller E, McCallum A (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. In: Proceedings of the 31st international conference on neural information processing systems
Coleman C, Yeh C, Mussmann S, Mirzasoleiman B, Bailis P, Liang P, Leskovec J, Zaharia M (2020) Selection via proxy: efficient data selection for deep learning. International conference on learning representations
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32:9:1627–1645
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 282:133–168
Article MATH Google Scholar
Fu W, Menzies T (2017) Easy over hard: a case study on deep learning. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp 49–60. https://doi.org/10.1145/3106237.3106256
Har-Peled S, Kushal A (2007) Smaller coresets for k-median and k-means clustering. Discrete Comput Geom 371:3–19. https://doi.org/10.1007/s00454-006-1271-x
Article MathSciNet MATH Google Scholar
Ho Q, Cipar J, Cui H, Kim JK et al (2013) More effective distributed ML via a stale synchronous parallel parameter server. Adv Neural Inf Process Syst 26:1223–1231
Google Scholar
Hull JJ (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 165:550–554. https://doi.org/10.1109/34.291440
Article Google Scholar
Jeong Y, Hwang M, Sung W (2020) Dataset distillation for core training set construction. Proceedings of the 9th international conference on smart media & applications. https://doi.org/10.1145/3426020.3426051
Katharopoulos A, François F (2018) Not all samples are created equal: deep learning with importance sampling. Proceedings of the 35th International Conference on Machine Learning 80, pp 2525–2534
Kishida I, Nakayama H (2019) Empirical study of easy and hard examples in CNN training. In: International conference on neural information processing, pp 179–188. Springer, Cham
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images
LeCun Y (1998) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed 31 Mar 2021
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. SIGIR’94
Li M, Andersen DG, Park JW, Smola AJ et al (2014) Scaling distributed machine learning with the parameter server. 11th USENIX symposium on operating systems design and implementation
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 2210:1345–1359. https://doi.org/10.1109/TKDE.2009.191
Article Google Scholar
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. Sci China Technological Sci 6310:1872–1897
Sener O, Savarese S (2018) Active learning for convolutional neural networks: a core-set approach. International conference on learning representations
Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. Proceedings of the 2008 conference on empirical methods in natural language processing
Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 761–769
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605
MATH Google Scholar
Vodrahalli K, Li K, Malik J (2018) Are all training examples created equal? An empirical study. arXiv preprint arXiv:1811.12569
Wang T, Zhu JY, Efros AA (2018) Dataset distillation. arXiv preprint arXiv:1811.10959
Yoo D, Kweon IS (2019) Learning loss for active learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Download references

Funding

This research was funded by Korea Institute of Science and Technology Information (KISTI).

Author information

Authors and Affiliations

AI Technology Research Center, Korea Institute of Science and Technology Information (KISTI), Daejeon, 34141, Korea
Yuna Jeong, Myunggwon Hwang & Wonkyung Sung
University of Science and Technology (UST), Daejeon, Korea
Myunggwon Hwang & Wonkyung Sung

Authors

Yuna Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Myunggwon Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Wonkyung Sung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Myunggwon Hwang.

Ethics declarations

Conflicts of interest/Competing interests

The authors declare no conflict and competing of interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jeong, Y., Hwang, M. & Sung, W. Training data selection based on dataset distillation for rapid deployment in machine-learning workflows. Multimed Tools Appl 82, 9855–9870 (2023). https://doi.org/10.1007/s11042-022-13701-6

Download citation

Received: 31 March 2021
Revised: 20 July 2022
Accepted: 23 August 2022
Published: 05 September 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11042-022-13701-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training data selection based on dataset distillation for rapid deployment in machine-learning workflows

Abstract

Access this article

Similar content being viewed by others

Finding High-Value Training Data Subset Through Differentiable Convex Programming

A survey on data‐efficient algorithms in big data era

Faster learning by reduction of data access time

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Training data selection based on dataset distillation for rapid deployment in machine-learning workflows

Abstract

Access this article

Similar content being viewed by others

Finding High-Value Training Data Subset Through Differentiable Convex Programming

A survey on data‐efficient algorithms in big data era

Faster learning by reduction of data access time

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation