Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter (O) September 27, 2019

Fast and simple dataset selection for machine learning

Schnelle und einfache Datensatz-Selektion für maschinelles Lernen
  • Timm J. Peter

    Timm J. Peter graduated with a Master of Science degree from Universität Siegen in 2018. After finishing his master’s thesis about regularized FIR models he has joined the working group Automatic Control – Mechatronics of Prof. Nelles as a research assistant. His research topics focus on new techniques for linear and nonlinear system identification.

    EMAIL logo
    and Oliver Nelles

    Oliver Nelles is Professor at the University of Siegen in the Department of Mechanical Engineering and chair of Automatic Control – Mechatronics. He received his doctor’s degree in 1999 at the Technical University of Darmstadt. His key research topics are nonlinear system identification, dynamics representations, design of experiments, metamodeling and local model networks.

Abstract

The task of data reduction is discussed and a novel selection approach which allows to control the optimal point distribution of the selected data subset is proposed. The proposed approach utilizes the estimation of probability density functions (pdfs). Due to its structure, the new method is capable of selecting a subset either by approximating the pdf of the original dataset or by approximating an arbitrary, desired target pdf. The new strategy evaluates the estimated pdfs solely on the selected data points, resulting in a simple and efficient algorithm with low computational and memory demand. The performance of the new approach is investigated for two different scenarios. For representative subset selection of a dataset, the new approach is compared to a recently proposed, more complex method and shows comparable results. For the demonstration of the capability of matching a target pdf, a uniform distribution is chosen as an example. Here the new method is compared to strategies for space-filling design of experiments and shows convincing results.

Zusammenfassung

Das Thema der Datenreduktion wird diskutiert und eine neue Methodik vorgestellt, welche es erlaubt, die optimale Punktverteilung eines selektierten Datenpunkt-Subsets zu kontrollieren. Der vorgestellte Ansatz nutzt die Schätzung von Wahrscheinlichkeitsdichtefunktionen. Durch die gewählte Struktur ist der Ansatz in der Lage, Subsets auszuwählen, indem entweder die Wahrscheinlichkeitsdichtefunktion des originalen Datensatzes oder eine beliebige, gewünschte Wahrscheinlichkeitsdichtefunktion approximiert wird. Der vorgestellte Ansatz wertet die geschätzten Wahrscheinlichkeitsdichtefunktionen nur an den selektierten Datenpunkten aus, was in einem einfachen und effizienten Algorithmus mit geringem Rechenaufwand und Speicherbedarf resultiert. Die Performanz des neuen Ansatzes wird in zwei verschiedenen Szenarios untersucht. Für die repräsentative Subset-Selektion eines Datensatzes wird der Ansatz mit einem kürzlich vorgestellten, komplexeren Algorithmus verglichen und zeigt dabei vergleichbare Resultate. Um die Fähigkeit der Anpassung an eine beliebige Zielverteilung zu demonstrieren, wird eine Gleichverteilung als Beispiel gewählt. Hier wird die neue Methode mit Strategien zum raumfüllenden Design verglichen und zeigt dabei überzeugende Ergebnisse.

About the authors

Timm J. Peter

Timm J. Peter graduated with a Master of Science degree from Universität Siegen in 2018. After finishing his master’s thesis about regularized FIR models he has joined the working group Automatic Control – Mechatronics of Prof. Nelles as a research assistant. His research topics focus on new techniques for linear and nonlinear system identification.

Oliver Nelles

Oliver Nelles is Professor at the University of Siegen in the Department of Mechanical Engineering and chair of Automatic Control – Mechatronics. He received his doctor’s degree in 1999 at the Technical University of Darmstadt. His key research topics are nonlinear system identification, dynamics representations, design of experiments, metamodeling and local model networks.

References

1. Paolo Brandimarte. Low-discrepancy sequences. Handbook in Monte Carlo Simulation: Applications in Financial Engineering, Risk Management, and Economics, pages 379–401, 2014.10.1002/9781118593264.ch9Search in Google Scholar

2. Petros Drineas and Michael W. Mahoney. On the Nyström method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, pages 2153–2175, 2005.10.1007/11503415_22Search in Google Scholar

3. Tobias Ebert, Torsten Fischer, Julian Belz, Tim Heinz, Geritt Kampmann and Oliver Nelles. Extended deterministic local search algorithm for maximin latin hypercube designs. In 2015 IEEE Symposium Series on Computational Intelligence: IEEE Symposium on Computational Intelligence in Control and Automation (2015 IEEE CICA), Cape Town, South Africa, December 2015.10.1109/SSCI.2015.63Search in Google Scholar

4. Pedro M. Ferreira. Unsupervised entropy-based selection of data sets for improved model fitting. In IEEE International Joint Conference on Neural Networks, Vancouver, BC Canada, August 2016.10.1109/IJCNN.2016.7727625Search in Google Scholar

5. Tim Oliver Heinz, Tobias Münker and Oliver Nelles. Data distribution assessment and optimal splitting of data sets. (Manuscript submitted for publication.) In 2019 International Joint Conference on Neural Networks, IJCNN 2019, 2019.Search in Google Scholar

6. Hamid Khosravani, Antonio Ruano and Pedro Ferreira. A convex hull-based data selection method for data driven models. Applied Soft Computing, 47:515–533, 2016.10.1016/j.asoc.2016.06.014Search in Google Scholar

7. Solomon Kullback. Information theory and statistics. John Wiley & Sons, 1959.Search in Google Scholar

8. Jiangang Liao, Yujun Wu and Yong Lin. Improving Sheather and Jones’ bandwidth selector for difficult densities in kernel density estimation. Journal of Nonparametric Statistics, 22:105–114, 2010.10.1080/10485250903194003Search in Google Scholar

9. Alan Miller. Subset Selection in Regression. Chapman & Hall, New York, 2002.10.1201/9781420035933Search in Google Scholar

10. Marie Ouimet and Yoshua Bengio. Greedy spectral embedding. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 253–260, Barbados, January 2005.Search in Google Scholar

11. Antonio R. C. Paiva. Information-theoretic dataset selection for fast kernel learning. In International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, May 2017.Search in Google Scholar

12. Antonio R. C. Paiva and Tolga Tasdizen. Fast semi-supervised image segmentation by novelty selection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, TX, USA, March 2010.Search in Google Scholar

13. Roberto Paredes and Enrique Vidal. Weighting prototypes – a new editing approach. In 15th International Conference on Pattern Recognition, Barcelona, Spain, September 2000.Search in Google Scholar

14. Luc Pronzato and Werner G. Müller. Design of computer experiments: space filling and beyond. Statistics and Computing, 22(3):681–701, 2012.10.1007/s11222-011-9242-3Search in Google Scholar

15. David W. Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.10.1002/9781118575574Search in Google Scholar

16. Mojtaba Seyedhosseini, Antonio R. C. Paiva and Tolga Tasdizen. Image parsing with a three-state series neural network classifier. In 20th International Conference on Pattern Recognition, Istanbul, Turkey, August 2010.10.1109/ICPR.2010.1095Search in Google Scholar

17. Simon Sheather and Michael Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, 53(3):683–690, 1991.10.1111/j.2517-6161.1991.tb01857.xSearch in Google Scholar

18. Bernard W. Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986.Search in Google Scholar

19. Alex J. Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, pages 911–918, Stanford, CA, USA, July 2000.Search in Google Scholar

20. Donghui Yan, Ling Huang and Michael I. Jordan. Fast approximate spectral clustering. In ACM International Conference Knowledge Discovery and Data Mining, Paris, France, June 2009.Search in Google Scholar

21. Kai Zhang and James Kwok. Density-weighted Nyström method forcomputing large kernel eigensystems. Neural Computation, pages 121–146, 2009.10.1162/neco.2009.11-07-651Search in Google Scholar

22. Kai Zhang, Ivor W. Tsang and James T. Kwok. Improved Nyström low rank approximation and error analysis. In International Conference on Machine Learning, Helsinki, Finland, July 2008.10.1145/1390156.1390311Search in Google Scholar

Received: 2019-01-31
Accepted: 2019-06-26
Published Online: 2019-09-27
Published in Print: 2019-10-25

© 2019 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 27.4.2024 from https://www.degruyter.com/document/doi/10.1515/auto-2019-0010/html
Scroll to top button