Skip to main content

Is the UCI Repository Useful for Data Mining?

  • Conference paper
Progress in Artificial Intelligence (EPIA 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2902))

Included in the following conference series:

Abstract

We propose a methodology to investigate the relevance for the real world of repositories of benchmark problems like the one commonly known as the UCI repository. It compares the distribution of relative performance of algorithms in data sets from a given repository and from the “real world”. If the distributions are different, the knowledge about the relative performance of algorithms obtained from the repository in question is mostly useless. In the case of the UCI repository, this would mean that a significant proportion of published results would be of little practical use. However, this is not what our results indicate. We also propose an adaptation of this method to test whether tool developers are “overfitting” repositories, which also yields negative results in the UCI repository.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bay, S., Kibler, D., Pazzani, M., Smyth, P.: The UCI KDD archive of large data sets for data mining research and experimentation. Information Processing Society of Japan Magazine 42(5) (2001)

    Google Scholar 

  2. Blake, C., Merz, C.: Repository of machine learning databases (1998), http:/, http://www.ics.uci.edu/~mlearn/MLRepository.html

  3. Brachman, R., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., Simoudis, E.: Mining business databases. Communications of the ACM 39(11), 42–48 (1996)

    Article  Google Scholar 

  4. Brazdil, P., Gama, J., Henery, B.: Characterizing the applicability of classification algorithms using meta-level learning. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 83–102. Springer, Heidelberg (1994)

    Google Scholar 

  5. Brazdil, P., Soares, C., Costa, J.: Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning 50(3), 251–277 (2003)

    Article  MATH  Google Scholar 

  6. Brodley, C.: Addressing the selective superiority problem: Automatic Algorithm/ Model class selection. In: Utgoff, P. (ed.) Proceedings of the Tenth International Conference on Machine Learning, pp. 17–24. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  7. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0: Step-by-Step Data Mining Guide. SPSS (2000)

    Google Scholar 

  8. Cohen, W.: Fast effective rule induction. In: Prieditis, A., Russell, S. (eds.) Proceedings of the 11th International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  9. Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10(7), 1895–1924 (1998), ftp://ftp.cs.orst.edu/pub/tgd/papers/nc-stats.ps.gz

    Article  Google Scholar 

  10. Gama, J.: Probabilistic linear tree. In: Fisher, D. (ed.) Proceedings of the 14th International Machine Learning Conference (ICML 1997), pp. 134–142. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  11. Gehrke, J.: Report on the SIGKDD 2001 conference panel ”new research directions on KDD”. SIGKDD Explorations 3(2) (2002)

    Google Scholar 

  12. Hilario, M., Kalousis, A.: Building algorithm profiles for prior model selection in knowledge discovery systems. In: Proceedings of the IEEE SMC 1999 International Conference on Systems, Man and Cybernetics, IEEE Press, Los Alamitos (1999)

    Google Scholar 

  13. Ihaka, R., Gentleman, R.: A language for data analysis and graphics. Journal of Computational and Graphics and Statistics 5(3), 299–314 (1996)

    Article  Google Scholar 

  14. Keller, J., Holzer, I., Silvery, S.: Using data envelopment analysis and casebased reasoning techniques for knowledge-based engine intake port design. In: Proceedings of the International Conference on Engineering Design, ICED 1999 (1999)

    Google Scholar 

  15. Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. In: Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 102–111. ACM, New York (2002)

    Chapter  Google Scholar 

  16. Kohavi, R., Sommerfield, D., Dougherty, J.: Data mining using MLC++: A machine learning library in c++. International Journal on Artificial Intelligence Tools 6(4), 537–566 (1997), Available at http://robotics.stanford.edu/users/ronnyk/mlcj.ps.gz

    Article  Google Scholar 

  17. Langley, P.: Crafting papers on machine learning. In: Langley, P. (ed.) Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1207–1212. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  18. Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning 40, 203–229 (2000)

    Article  MATH  Google Scholar 

  19. Lindner, G., Studer, R.: Forecastig the fault rate behavior for cars. In: Proceedings of the “From Machine Learning to Data Mining and Knowledge Discovery” ICML 1999 Workshop (1999)

    Google Scholar 

  20. METAL Consortium. Esprit project METAL (#26.357) (2002)

    Google Scholar 

  21. Michie, D., Spiegelhalter, D., Taylor, C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994)

    Google Scholar 

  22. Neave, H., Worthington, P.: Distribution-Free Tests. Routledge, New York (1992)

    Google Scholar 

  23. Oldemiro, M., Torgo, L.: Predicting daily returns for the ibm stock. In: Torgo, L. (ed.) Proceedings of the Workshop on Artificial Intelligence Techniques for Financial Time Series Analysis (2001), Available from http://www.liacc.up.pt/~ltorgo/AIFTSA

  24. Petrak, J.: MLEE: Machine Learning Experimentation Environment. OFAI (2002), http://www.ai.univie.ac.at/~johann/mlee.html

  25. Provost, F., Kohavi, R.: On applied research in machine learning. Machine Learning 30(2/3), 127–132 (1998)

    Article  Google Scholar 

  26. Quinlan, R.: C5.0: An Informal Tutorial. RuleQuest (1998), http://www.rulequest.com/see5-unix.html

  27. Saitta, L., Neri, F.: Learning in the “real world”. Machine Learning 30(2/3), 133–163 (1998)

    Article  Google Scholar 

  28. Salzberg, S.: On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1, 317–327 (1997), http://www.cs.jhu.edu/~salzberg/critique.ps

    Article  Google Scholar 

  29. Staudt, M., Kietz, J.-U., Reimer, U.: A data mining support environment and its application on insurance data. In: Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R. (eds.) Proceedings of the Fourth International Conference on Knowledge Discovery in Databases & Data Mining, pp. 105–111. AAAI Press, Menlo Park (1998)

    Google Scholar 

  30. Wolpert, D.: The lack of a priori distinctions between learning algorithms. Neural Computation 8, 1341–1390 (1996)

    Article  Google Scholar 

  31. Zaki, M.: Editorial: Online, interactive and anytime data mining. SIGKDD Explorations 3(2) (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Soares, C. (2003). Is the UCI Repository Useful for Data Mining?. In: Pires, F.M., Abreu, S. (eds) Progress in Artificial Intelligence. EPIA 2003. Lecture Notes in Computer Science(), vol 2902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24580-3_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24580-3_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20589-0

  • Online ISBN: 978-3-540-24580-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics