Skip to main content
Log in

Training data selection based on dataset distillation for rapid deployment in machine-learning workflows

  • 1222: Intelligent Multimedia Data Analytics and Computing
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Recently, nonlinear machine-learning models have been effectively applied to multimedia data, contributing greatly to various downstream tasks. However, large amounts of training data are required to properly train many parameters and achieve reasonable performance in nonlinear models. Using a large amount of data significantly increases time and cost, which are limited resources of model development and distribution processes. The goal of our study is to construct a core set that approximates the entire original dataset so that we can quickly observe performance changes caused by model redesign or parameter changes in machine learning deployment. The core set is mainly composed of informative samples with a high contribution to the train. We measure the contribution of the sample based on the dataset distillation and perform area-based sampling for generalization. The core set can be construct in a short time by measuring the learning contribution with only a small number of distilled images. The experimental results showed that our method selects more useful samples compared to random sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Code availability

Not applicable.

References

  1. Agarwal PK, Har-Peled S, Varadarajan KR (2005) Geometric approximation via coresets. Comb Comput Geom 52:1–30

    MathSciNet  MATH  Google Scholar 

  2. Bermudez I, Traverso S, Mellia M, Munafo M (2013) Exploring the cloud from passive measurements: the Amazon AWS case. 2013 proceedings IEEE INFOCOM. https://doi.org/10.1109/INFCOM.2013.6566769

  3. Chang HS, Learned-Miller E, McCallum A (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. In: Proceedings of the 31st international conference on neural information processing systems

  4. Coleman C, Yeh C, Mussmann S, Mirzasoleiman B, Bailis P, Liang P, Leskovec J, Zaharia M (2020) Selection via proxy: efficient data selection for deep learning. International conference on learning representations

  5. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

  6. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32:9:1627–1645

  7. Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 282:133–168

    Article  MATH  Google Scholar 

  8. Fu W, Menzies T (2017) Easy over hard: a case study on deep learning. Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp 49–60. https://doi.org/10.1145/3106237.3106256

  9. Har-Peled S, Kushal A (2007) Smaller coresets for k-median and k-means clustering. Discrete Comput Geom 371:3–19. https://doi.org/10.1007/s00454-006-1271-x

    Article  MathSciNet  MATH  Google Scholar 

  10. Ho Q, Cipar J, Cui H, Kim JK et al (2013) More effective distributed ML via a stale synchronous parallel parameter server. Adv Neural Inf Process Syst 26:1223–1231

    Google Scholar 

  11. Hull JJ (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 165:550–554. https://doi.org/10.1109/34.291440

    Article  Google Scholar 

  12. Jeong Y, Hwang M, Sung W (2020) Dataset distillation for core training set construction. Proceedings of the 9th international conference on smart media & applications. https://doi.org/10.1145/3426020.3426051

  13. Katharopoulos A, François F (2018) Not all samples are created equal: deep learning with importance sampling. Proceedings of the 35th International Conference on Machine Learning 80, pp 2525–2534

  14. Kishida I, Nakayama H (2019) Empirical study of easy and hard examples in CNN training. In: International conference on neural information processing, pp 179–188. Springer, Cham

  15. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images

  16. LeCun Y (1998) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed 31 Mar 2021

  17. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  18. Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. SIGIR’94

  19. Li M, Andersen DG, Park JW, Smola AJ et al (2014) Scaling distributed machine learning with the parameter server. 11th USENIX symposium on operating systems design and implementation

  20. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 2210:1345–1359. https://doi.org/10.1109/TKDE.2009.191

    Article  Google Scholar 

  21. Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. Sci China Technological Sci 6310:1872–1897

  22. Sener O, Savarese S (2018) Active learning for convolutional neural networks: a core-set approach. International conference on learning representations

  23. Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. Proceedings of the 2008 conference on empirical methods in natural language processing

  24. Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 761–769

  25. Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605

    MATH  Google Scholar 

  26. Vodrahalli K, Li K, Malik J (2018) Are all training examples created equal? An empirical study. arXiv preprint arXiv:1811.12569

  27. Wang T, Zhu JY, Efros AA (2018) Dataset distillation. arXiv preprint arXiv:1811.10959

  28. Yoo D, Kweon IS (2019) Learning loss for active learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Download references

Funding

This research was funded by Korea Institute of Science and Technology Information (KISTI).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Myunggwon Hwang.

Ethics declarations

Conflicts of interest/Competing interests

The authors declare no conflict and competing of interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jeong, Y., Hwang, M. & Sung, W. Training data selection based on dataset distillation for rapid deployment in machine-learning workflows. Multimed Tools Appl 82, 9855–9870 (2023). https://doi.org/10.1007/s11042-022-13701-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13701-6

Keywords

Navigation