Skip to main content

Efficient Sampling and Handling of Variance in Tuning Data Mining Models

  • Conference paper
Parallel Problem Solving from Nature - PPSN XII (PPSN 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7491))

Included in the following conference series:

  • 1842 Accesses

Abstract

Computational Intelligence (CI) provides good and robust working solutions for global optimization. CI is especially suited for solving difficult tasks in parameter optimization when the fitness function is noisy. Such situations and fitness landscapes frequently arise in real-world applications like Data Mining (DM). Unfortunately, parameter tuning in DM is computationally expensive and CI-based methods often require lots of function evaluations until they finally converge in good solutions. Earlier studies have shown that surrogate models can lead to a decrease of real function evaluations. However, each function evaluation remains time-consuming. In this paper we investigate if and how the fitness landscape of the parameter space changes, when only fewer observations are used for the model trainings during tuning. A representative study on seven DM tasks shows that the results are nevertheless competitive. On all these tasks, a fraction of 10-15% of the training data is sufficient. With this the computation time can be reduced by a factor of 6-10.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bartz-Beielstein, T., Lasarczyk, C., Preuss, M.: The SPO Toolbox. In: Bartz-Beielstein, et al. (eds.) Experimental Methods for the Analysis of Optimization Algorithms, pp. 337–360. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  2. Bischl, B., Mersmann, O., Trautmann, H.: Resampling methods in model validation. In: Proc. WEMACS 2010, Joint to PPSN 2010, Krakow, p. 14 (2010)

    Google Scholar 

  3. Cochran, W.G.: Sampling techniques. Wiley-India (2007)

    Google Scholar 

  4. Daelemans, W., Hoste, V., De Meulder, F., Naudts, B.: Combined Optimization of Feature Selection and Algorithm Parameters in Machine Learning of Language. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 84–95. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. Efron, B.: Bootstrap methods: another look at the jackknife. The Annals of Statistics 7(1), 1–26 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  6. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation 9, 159–195 (2001)

    Article  Google Scholar 

  7. Jain, A., Zongker, D.: Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on PAMI 19(2), 153–158 (1997)

    Article  Google Scholar 

  8. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: Kernlab — An S4 package for kernel methods in R. Technical report, Department of Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna (2004)

    Google Scholar 

  9. Koch, P., Konen, W.: Stability issues in tuning very noisy functions. Submitted to PPSN 2012 Workshop on Automated Selection and Tuning of Algorithms (2012)

    Google Scholar 

  10. Konen, W., Koch, P., Flasch, O., Bartz-Beielstein, T., Friese, M., Naujoks, B.: Tuned data mining: A benchmark study on different tuners. In: Proc. GECCO 2011, Dublin, pp. 1995–2002. ACM (2011)

    Google Scholar 

  11. Lenth, R.V.: Some practical guidelines for effective sample size determination. The American Statistician 55(3), 187–193 (2001)

    Article  MathSciNet  Google Scholar 

  12. McKay, M.D., Beckman, R.J., Conover, W.J.: A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 239–245 (1979)

    Google Scholar 

  13. Osuna, E., Freund, R., Girosi, F.: Training support vector machines: an application to face detection. In: IEEE Proc. CVPR 1997, pp. 130–136 (1997)

    Google Scholar 

  14. Raudys, S.J., Jain, A.K.: Small sample size effects in statistical pattern recognition. IEEE Transactions on PAMI 13(3), 252–264 (1991)

    Article  Google Scholar 

  15. Schölkopf, B., Smola, A.J.: Learning with kernels: Support vector machines, regularization, optimization, and beyond. The MIT Press (2002)

    Google Scholar 

  16. Vapnik, V.: Statistical learning theory (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Koch, P., Konen, W. (2012). Efficient Sampling and Handling of Variance in Tuning Data Mining Models. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds) Parallel Problem Solving from Nature - PPSN XII. PPSN 2012. Lecture Notes in Computer Science, vol 7491. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32937-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32937-1_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32936-4

  • Online ISBN: 978-3-642-32937-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics