Abstract
Recently, nonlinear machine-learning models have been effectively applied to multimedia data, contributing greatly to various downstream tasks. However, large amounts of training data are required to properly train many parameters and achieve reasonable performance in nonlinear models. Using a large amount of data significantly increases time and cost, which are limited resources of model development and distribution processes. The goal of our study is to construct a core set that approximates the entire original dataset so that we can quickly observe performance changes caused by model redesign or parameter changes in machine learning deployment. The core set is mainly composed of informative samples with a high contribution to the train. We measure the contribution of the sample based on the dataset distillation and perform area-based sampling for generalization. The core set can be construct in a short time by measuring the learning contribution with only a small number of distilled images. The experimental results showed that our method selects more useful samples compared to random sampling.
Similar content being viewed by others
Data availability
The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.
Code availability
Not applicable.
References
Agarwal PK, Har-Peled S, Varadarajan KR (2005) Geometric approximation via coresets. Comb Comput Geom 52:1–30
Bermudez I, Traverso S, Mellia M, Munafo M (2013) Exploring the cloud from passive measurements: the Amazon AWS case. 2013 proceedings IEEE INFOCOM. https://doi.org/10.1109/INFCOM.2013.6566769
Chang HS, Learned-Miller E, McCallum A (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. In: Proceedings of the 31st international conference on neural information processing systems
Coleman C, Yeh C, Mussmann S, Mirzasoleiman B, Bailis P, Liang P, Leskovec J, Zaharia M (2020) Selection via proxy: efficient data selection for deep learning. International conference on learning representations
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32:9:1627–1645
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 282:133–168
Fu W, Menzies T (2017) Easy over hard: a case study on deep learning. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp 49–60. https://doi.org/10.1145/3106237.3106256
Har-Peled S, Kushal A (2007) Smaller coresets for k-median and k-means clustering. Discrete Comput Geom 371:3–19. https://doi.org/10.1007/s00454-006-1271-x
Ho Q, Cipar J, Cui H, Kim JK et al (2013) More effective distributed ML via a stale synchronous parallel parameter server. Adv Neural Inf Process Syst 26:1223–1231
Hull JJ (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 165:550–554. https://doi.org/10.1109/34.291440
Jeong Y, Hwang M, Sung W (2020) Dataset distillation for core training set construction. Proceedings of the 9th international conference on smart media & applications. https://doi.org/10.1145/3426020.3426051
Katharopoulos A, François F (2018) Not all samples are created equal: deep learning with importance sampling. Proceedings of the 35th International Conference on Machine Learning 80, pp 2525–2534
Kishida I, Nakayama H (2019) Empirical study of easy and hard examples in CNN training. In: International conference on neural information processing, pp 179–188. Springer, Cham
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images
LeCun Y (1998) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed 31 Mar 2021
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. SIGIR’94
Li M, Andersen DG, Park JW, Smola AJ et al (2014) Scaling distributed machine learning with the parameter server. 11th USENIX symposium on operating systems design and implementation
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 2210:1345–1359. https://doi.org/10.1109/TKDE.2009.191
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. Sci China Technological Sci 6310:1872–1897
Sener O, Savarese S (2018) Active learning for convolutional neural networks: a core-set approach. International conference on learning representations
Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. Proceedings of the 2008 conference on empirical methods in natural language processing
Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 761–769
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605
Vodrahalli K, Li K, Malik J (2018) Are all training examples created equal? An empirical study. arXiv preprint arXiv:1811.12569
Wang T, Zhu JY, Efros AA (2018) Dataset distillation. arXiv preprint arXiv:1811.10959
Yoo D, Kweon IS (2019) Learning loss for active learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Funding
This research was funded by Korea Institute of Science and Technology Information (KISTI).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest/Competing interests
The authors declare no conflict and competing of interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jeong, Y., Hwang, M. & Sung, W. Training data selection based on dataset distillation for rapid deployment in machine-learning workflows. Multimed Tools Appl 82, 9855–9870 (2023). https://doi.org/10.1007/s11042-022-13701-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13701-6