Is the UCI Repository Useful for Data Mining?

Soares, Carlos

doi:10.1007/978-3-540-24580-3_28

Carlos Soares⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2902))

Included in the following conference series:

Portuguese Conference on Artificial Intelligence

1028 Accesses
5 Citations

Abstract

We propose a methodology to investigate the relevance for the real world of repositories of benchmark problems like the one commonly known as the UCI repository. It compares the distribution of relative performance of algorithms in data sets from a given repository and from the “real world”. If the distributions are different, the knowledge about the relative performance of algorithms obtained from the repository in question is mostly useless. In the case of the UCI repository, this would mean that a significant proportion of published results would be of little practical use. However, this is not what our results indicate. We also propose an adaptation of this method to test whether tool developers are “overfitting” repositories, which also yields negative results in the UCI repository.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bay, S., Kibler, D., Pazzani, M., Smyth, P.: The UCI KDD archive of large data sets for data mining research and experimentation. Information Processing Society of Japan Magazine 42(5) (2001)
Google Scholar
Blake, C., Merz, C.: Repository of machine learning databases (1998), http:/, http://www.ics.uci.edu/~mlearn/MLRepository.html
Brachman, R., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., Simoudis, E.: Mining business databases. Communications of the ACM 39(11), 42–48 (1996)
Article Google Scholar
Brazdil, P., Gama, J., Henery, B.: Characterizing the applicability of classification algorithms using meta-level learning. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 83–102. Springer, Heidelberg (1994)
Google Scholar
Brazdil, P., Soares, C., Costa, J.: Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning 50(3), 251–277 (2003)
Article MATH Google Scholar
Brodley, C.: Addressing the selective superiority problem: Automatic Algorithm/ Model class selection. In: Utgoff, P. (ed.) Proceedings of the Tenth International Conference on Machine Learning, pp. 17–24. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0: Step-by-Step Data Mining Guide. SPSS (2000)
Google Scholar
Cohen, W.: Fast effective rule induction. In: Prieditis, A., Russell, S. (eds.) Proceedings of the 11th International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10(7), 1895–1924 (1998), ftp://ftp.cs.orst.edu/pub/tgd/papers/nc-stats.ps.gz
Article Google Scholar
Gama, J.: Probabilistic linear tree. In: Fisher, D. (ed.) Proceedings of the 14th International Machine Learning Conference (ICML 1997), pp. 134–142. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Gehrke, J.: Report on the SIGKDD 2001 conference panel ”new research directions on KDD”. SIGKDD Explorations 3(2) (2002)
Google Scholar
Hilario, M., Kalousis, A.: Building algorithm profiles for prior model selection in knowledge discovery systems. In: Proceedings of the IEEE SMC 1999 International Conference on Systems, Man and Cybernetics, IEEE Press, Los Alamitos (1999)
Google Scholar
Ihaka, R., Gentleman, R.: A language for data analysis and graphics. Journal of Computational and Graphics and Statistics 5(3), 299–314 (1996)
Article Google Scholar
Keller, J., Holzer, I., Silvery, S.: Using data envelopment analysis and casebased reasoning techniques for knowledge-based engine intake port design. In: Proceedings of the International Conference on Engineering Design, ICED 1999 (1999)
Google Scholar
Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. In: Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 102–111. ACM, New York (2002)
Chapter Google Scholar
Kohavi, R., Sommerfield, D., Dougherty, J.: Data mining using MLC++: A machine learning library in c++. International Journal on Artificial Intelligence Tools 6(4), 537–566 (1997), Available at http://robotics.stanford.edu/users/ronnyk/mlcj.ps.gz
Article Google Scholar
Langley, P.: Crafting papers on machine learning. In: Langley, P. (ed.) Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1207–1212. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning 40, 203–229 (2000)
Article MATH Google Scholar
Lindner, G., Studer, R.: Forecastig the fault rate behavior for cars. In: Proceedings of the “From Machine Learning to Data Mining and Knowledge Discovery” ICML 1999 Workshop (1999)
Google Scholar
METAL Consortium. Esprit project METAL (#26.357) (2002)
Google Scholar
Michie, D., Spiegelhalter, D., Taylor, C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994)
Google Scholar
Neave, H., Worthington, P.: Distribution-Free Tests. Routledge, New York (1992)
Google Scholar
Oldemiro, M., Torgo, L.: Predicting daily returns for the ibm stock. In: Torgo, L. (ed.) Proceedings of the Workshop on Artificial Intelligence Techniques for Financial Time Series Analysis (2001), Available from http://www.liacc.up.pt/~ltorgo/AIFTSA
Petrak, J.: MLEE: Machine Learning Experimentation Environment. OFAI (2002), http://www.ai.univie.ac.at/~johann/mlee.html
Provost, F., Kohavi, R.: On applied research in machine learning. Machine Learning 30(2/3), 127–132 (1998)
Article Google Scholar
Quinlan, R.: C5.0: An Informal Tutorial. RuleQuest (1998), http://www.rulequest.com/see5-unix.html
Saitta, L., Neri, F.: Learning in the “real world”. Machine Learning 30(2/3), 133–163 (1998)
Article Google Scholar
Salzberg, S.: On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1, 317–327 (1997), http://www.cs.jhu.edu/~salzberg/critique.ps
Article Google Scholar
Staudt, M., Kietz, J.-U., Reimer, U.: A data mining support environment and its application on insurance data. In: Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R. (eds.) Proceedings of the Fourth International Conference on Knowledge Discovery in Databases & Data Mining, pp. 105–111. AAAI Press, Menlo Park (1998)
Google Scholar
Wolpert, D.: The lack of a priori distinctions between learning algorithms. Neural Computation 8, 1341–1390 (1996)
Article Google Scholar
Zaki, M.: Editorial: Online, interactive and anytime data mining. SIGKDD Explorations 3(2) (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

LIACC/Fac. of Economics, University of Porto, Rua do Campo Alegre, 823, 4150-180, Porto, Portugal
Carlos Soares

Authors

Carlos Soares
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Informática, Universidade de Évora, Rua Romão Ramalho, 59, 7000, Évora, Portugal
Fernando Moura Pires
Universidade de Évora and CENTRIA FCT/UNL, Portugal
Salvador Abreu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soares, C. (2003). Is the UCI Repository Useful for Data Mining?. In: Pires, F.M., Abreu, S. (eds) Progress in Artificial Intelligence. EPIA 2003. Lecture Notes in Computer Science(), vol 2902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24580-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-540-24580-3_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20589-0
Online ISBN: 978-3-540-24580-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics