Skip to main content
Log in

Where are the large and difficult datasets?

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

A great many comparative performance assessments of classification rules have been undertaken, ranging from small ones involving just one or two methods, to large ones involving many tens of methods. We are undertaking a meta-analytic study of these studies, attempting to distil some overall conclusions. This paper describes just one of our observations. The dataset analysed in this paper contains 5,203 error rates taken from 45 articles and describing 146 datasets. One curious general relationship which was persistent in our data, despite the fact that we were looking at results mixed between distributions rather than conditional on distributions, was that error rate decreased with increasing dataset size. We believe this to be an artefact of the way datasets are collected by the research community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Atlas L, Connor J, Dong P, Lippman A, Cole R, Muthusamy Y (1991) A performance comparison of trained multi-player perceptrons and trained classification trees. In: Systems, man and cybernetics: proceedings of the 1989 IEEE international conference, Cambridge, Hyatt Regency, pp 915–920

  • Blake CL, Merz CJ (1998) UCI repository of machine learning databases. http://www1.ics.uci.edu/~mlearn/MLRepository.html, University of California, Irvine, Dept. of Information and Computer Sciences

  • Brazdil PB, Soares C, Pinto da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50: 251–277

    Article  MATH  Google Scholar 

  • Eklund PW, Hoang A (2002) A performance survey of public domain supervised machine learning algorithms. http://citeseer.nj.nec.com/551273.html

  • Hand DJ (1999) Intelligent data analysis: an introduction. In: Berthold M, Hand DJ(eds) Intelligent data analysis. Springer, Berlin

    Chapter  Google Scholar 

  • Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–91

    Article  MATH  Google Scholar 

  • Jamain A (2004) Meta-analysis of classification methods. PhD thesis, Department of Mathematics, Imperial College, London (2004)

  • Jamain A, Hand DJ (2005) The Naive Bayes mystery: a classification detective story. Pattern Recognit Lett 26: 1752–1760

    Article  Google Scholar 

  • Jamain A., Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25(1): 87–112

    Article  Google Scholar 

  • Lim T, Loh W, Shih Y (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40: 203–228

    Article  MATH  Google Scholar 

  • METAL Consortium . Esprit project METAL (#26.357). http://www.metal-kdd.org, 2002

  • Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Ellis Horwood, New York

    MATH  Google Scholar 

  • Perlich C, Provost F, Simonoff JS (2003) Tree induction versus logistic regresion: a learning-curve analysis. J Mach Learn Res 4: 211–255

    Article  MathSciNet  Google Scholar 

  • Quinlan JR (1994) Comparing connectionist and symbolic learning methods, volume I: constraints and Prospects. MIT Press, Cambridge, pp 445–456. http://citeseer.nj.nec.com/quinlan94comparing.html

  • Rasmussen CE, Neal RM, Hinton GE, van Camp D, Revow M, Ghahramani Z, Kustra R, Tibshirani R (1996) DELVE, Data for evaluating learning in valid experiments. http://www.cs.toronto.edu/~delve/

  • Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Discov 1: 317–328

    Article  Google Scholar 

  • Sargent DJ (2001) Comparison of artificial neural networks with other statistical approaches. Cancer 91: 1636–1642

    Article  Google Scholar 

  • Shavlik JW, Mooney RJ, Towell GG (1991) Symbolic and neural learning algorithms: an experimental comparison. Mach Learn 6: 111–143

    Google Scholar 

  • Soares C (2002) Is the UCI repository useful for data mining? In: Lavrac N, Motoda H, Fawcett T (eds) Proceedings of the ICML-2002 workshop on data mining lessons learned

  • Sohn SY (1999) Meta-analysis of classification algorithms for pattern recognition. IEEE Trans Pattern Recognit Mach Intell 21(11): 1137–1144

    Article  Google Scholar 

  • Viswanathan M, Webb GI (1998) Classification learning using all rules. In: 11th European conference on machine learning. Springer, Berlin, pp 150–159

  • Zarndt F (1995) A comprehensive case study: an examination of machine learning and connectionnist algorithms. http://citeseer.nj.nec.com/481595.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrien Jamain.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jamain, A., Hand, D.J. Where are the large and difficult datasets?. Adv Data Anal Classif 3, 25–38 (2009). https://doi.org/10.1007/s11634-009-0037-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-009-0037-8

Keywords

Mathematics Subject Classification (2000)

Navigation