Abstract
We propose a methodology to investigate the relevance for the real world of repositories of benchmark problems like the one commonly known as the UCI repository. It compares the distribution of relative performance of algorithms in data sets from a given repository and from the “real world”. If the distributions are different, the knowledge about the relative performance of algorithms obtained from the repository in question is mostly useless. In the case of the UCI repository, this would mean that a significant proportion of published results would be of little practical use. However, this is not what our results indicate. We also propose an adaptation of this method to test whether tool developers are “overfitting” repositories, which also yields negative results in the UCI repository.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bay, S., Kibler, D., Pazzani, M., Smyth, P.: The UCI KDD archive of large data sets for data mining research and experimentation. Information Processing Society of Japan Magazine 42(5) (2001)
Blake, C., Merz, C.: Repository of machine learning databases (1998), http:/, http://www.ics.uci.edu/~mlearn/MLRepository.html
Brachman, R., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., Simoudis, E.: Mining business databases. Communications of the ACM 39(11), 42–48 (1996)
Brazdil, P., Gama, J., Henery, B.: Characterizing the applicability of classification algorithms using meta-level learning. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 83–102. Springer, Heidelberg (1994)
Brazdil, P., Soares, C., Costa, J.: Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning 50(3), 251–277 (2003)
Brodley, C.: Addressing the selective superiority problem: Automatic Algorithm/ Model class selection. In: Utgoff, P. (ed.) Proceedings of the Tenth International Conference on Machine Learning, pp. 17–24. Morgan Kaufmann, San Francisco (1993)
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0: Step-by-Step Data Mining Guide. SPSS (2000)
Cohen, W.: Fast effective rule induction. In: Prieditis, A., Russell, S. (eds.) Proceedings of the 11th International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Francisco (1995)
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10(7), 1895–1924 (1998), ftp://ftp.cs.orst.edu/pub/tgd/papers/nc-stats.ps.gz
Gama, J.: Probabilistic linear tree. In: Fisher, D. (ed.) Proceedings of the 14th International Machine Learning Conference (ICML 1997), pp. 134–142. Morgan Kaufmann, San Francisco (1997)
Gehrke, J.: Report on the SIGKDD 2001 conference panel ”new research directions on KDD”. SIGKDD Explorations 3(2) (2002)
Hilario, M., Kalousis, A.: Building algorithm profiles for prior model selection in knowledge discovery systems. In: Proceedings of the IEEE SMC 1999 International Conference on Systems, Man and Cybernetics, IEEE Press, Los Alamitos (1999)
Ihaka, R., Gentleman, R.: A language for data analysis and graphics. Journal of Computational and Graphics and Statistics 5(3), 299–314 (1996)
Keller, J., Holzer, I., Silvery, S.: Using data envelopment analysis and casebased reasoning techniques for knowledge-based engine intake port design. In: Proceedings of the International Conference on Engineering Design, ICED 1999 (1999)
Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. In: Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 102–111. ACM, New York (2002)
Kohavi, R., Sommerfield, D., Dougherty, J.: Data mining using MLC++: A machine learning library in c++. International Journal on Artificial Intelligence Tools 6(4), 537–566 (1997), Available at http://robotics.stanford.edu/users/ronnyk/mlcj.ps.gz
Langley, P.: Crafting papers on machine learning. In: Langley, P. (ed.) Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1207–1212. Morgan Kaufmann, San Francisco (2000)
Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning 40, 203–229 (2000)
Lindner, G., Studer, R.: Forecastig the fault rate behavior for cars. In: Proceedings of the “From Machine Learning to Data Mining and Knowledge Discovery” ICML 1999 Workshop (1999)
METAL Consortium. Esprit project METAL (#26.357) (2002)
Michie, D., Spiegelhalter, D., Taylor, C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994)
Neave, H., Worthington, P.: Distribution-Free Tests. Routledge, New York (1992)
Oldemiro, M., Torgo, L.: Predicting daily returns for the ibm stock. In: Torgo, L. (ed.) Proceedings of the Workshop on Artificial Intelligence Techniques for Financial Time Series Analysis (2001), Available from http://www.liacc.up.pt/~ltorgo/AIFTSA
Petrak, J.: MLEE: Machine Learning Experimentation Environment. OFAI (2002), http://www.ai.univie.ac.at/~johann/mlee.html
Provost, F., Kohavi, R.: On applied research in machine learning. Machine Learning 30(2/3), 127–132 (1998)
Quinlan, R.: C5.0: An Informal Tutorial. RuleQuest (1998), http://www.rulequest.com/see5-unix.html
Saitta, L., Neri, F.: Learning in the “real world”. Machine Learning 30(2/3), 133–163 (1998)
Salzberg, S.: On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1, 317–327 (1997), http://www.cs.jhu.edu/~salzberg/critique.ps
Staudt, M., Kietz, J.-U., Reimer, U.: A data mining support environment and its application on insurance data. In: Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R. (eds.) Proceedings of the Fourth International Conference on Knowledge Discovery in Databases & Data Mining, pp. 105–111. AAAI Press, Menlo Park (1998)
Wolpert, D.: The lack of a priori distinctions between learning algorithms. Neural Computation 8, 1341–1390 (1996)
Zaki, M.: Editorial: Online, interactive and anytime data mining. SIGKDD Explorations 3(2) (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Soares, C. (2003). Is the UCI Repository Useful for Data Mining?. In: Pires, F.M., Abreu, S. (eds) Progress in Artificial Intelligence. EPIA 2003. Lecture Notes in Computer Science(), vol 2902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24580-3_28
Download citation
DOI: https://doi.org/10.1007/978-3-540-24580-3_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20589-0
Online ISBN: 978-3-540-24580-3
eBook Packages: Springer Book Archive