Abstract
Large databases can be a source of useful knowledge. Yet this knowledge is implicit in the data. It must be mined and expressed in a concise, useful form of statistical patterns, equations, rules, conceptual hierarchies, and the like. Automation of knowledge discovery is important because databases are growing in size and number, and standard data analysis techniques are not designed for exploration of huge hypotheses spaces. We concentrate on discovery of regularities, defining a regularity by a pattern and the range in which that pattern holds. We argue that two types of patterns are particularly important: contingency tables and equations, and we present Forty-Niner (49er), a general-purpose database mining system which conducts large-scale search for those patterns in many subsets of data, conducting a more costly search for equations only when data indicate a functional relationship. 49er can refine the initial regularities to yield stronger and more general regularities and more useful concepts. 49er combines several searches, each contributing to a different aspect of a regularity. Correspondence between the components of search and the structure of regularities makes the system easy to understand, use, and expand. Finally, we discuss 49er's performance in four categories of tests: (1) open exploration of new databases; (2) reproduction of human findings (limited because databases which have been extensively explored are very rare); (3) hide- and -seek testing on artificially created data, to evaluate 49er on large scale against known results; (4) exploration of randomly generated databases.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bhattacharyya, G.K. and Johnson, R.A. (1986).Statistical Concepts and Methods. New York: Wiley.
Cai, Y., Cereone, Y., and Jiawei, H. (1989). Attribute-oriented induction in relational databases.Proc. Int. Workshop Knowledge Discovery in Databases, IJCAI-89, Detroit, MI.
Chimenti, D., Gamboa, R., Krishnamurthy, R., Naqvi, S., Tsur, S., and Zaniolo, C (1990). The LDL System Prototype,IEEE Transactions on Knowledge and Data Engineering, Vol. 2–1, pp.
Chipman, S.F., Krantz, D.H., and Silver, R. (1990). Mathematics Anxiety and Science Careers Among Able College Women. Technical Report.
Eadie, W.T., Drijard, D., James, F.E., Roos, M., Sadoulet, B. (1971).Statistical Methods in Experimental Physics, Amsterdam: North-Holland.
Falkenhainer, B.C. and Michalski, R.S. (1986). Integrating Quantitative and Qualitative Discovery: The ABACUS System.Machine Learning, 1, 367–401.
Fisher, D.H. (1987). Knowledge Acquisition via Incremental Conceptual Clustering.Machine Learning, 2, 139–172.
Glymour, C., Scheines, R., Spirtes, P., and Kelly, K. (1987).Discovering Casual Structure. San Diego, CA: Academic Press.
Gokhale, D.V. and Kullback, S. (1978).The Information in Contingency Tables. New York: Marcel Dekker.
Harris, R.J. (1985).A Primer of Multivariate Statistics. New York: Academic Press.
Hoschka, P. and Klösgen, W. (1991). A Support System for Interpreting Statistical Data. In G. Piatetsky-Shapiro and W. Frawley (Eds.),Knowledge Discovery in Databases. Menlo Park, CA: AAAI Press.
Kaufman, K.A., Michalski, R.S., and Kerschberg, L. (1991). An Architecture for Integrating Machine Learning and Discovery Programs into a Data Analysis System. In G. Piatetsky-Shapiro (Ed.),Proc. AAAI-91 Workshop on Knowledge Discovery in Databases, (pp. 35–51).
Klösgen, W. (1992). Patterns for Knowledge Discovery in Databases. In J. Żytkow (Ed.),Proc. ML-92 Workshop Machine Discovery (MD-92), (pp. 1–10), National Institute for Aviation Research, Wichita, KS.
Langley, P., Simon, H.A., Bradshaw, G.L., and Żytkow, J.M. (1987).Scientific Discovery: Computational Explorations of the Creative Processes. Cambridge, MA: MIT Press.
Lisp-Stat (1991) Book Review.Statistical Science, 6-4, 339–362.
Michalski, R.S., Kerschberg, L. Kaufman, K.A., and Ribeiro, J.S. (1992). Mining for Knowledge in Databases: The INLEN Architecture, Initial Implementation and First Results.Intelligent Information Systems: Integrating Artificial Intelligence and Database Technologies, 1-1, 85–113.
Naqvi, S. and Tsur, S. (1989).A Logical Language for Data and Knowledge Bases. New York: Computer Science Press.
Piatetsky-Shapiro, G. (1992). Probabilistic Data Dependencies. In J. Żytkow (Ed.),Proc. ML-92 Workshop on Machine Discovery, (pp. 11–17). National Institute for Aviation Research, Wichita, KS.
Piatetsky-Shapiro, G.(ed.) (1991).Proc. AAAI-91 Workshop Knowledge Discovery in Databases. San Diego, CA.
Piatetsky-Shapiro, G. and Frawley, W. (eds.) (1991).Knowledge Discovery in Databases. Menlo Park, CA: AAAI Press.
Piatetsky-Shapiro, G. and Matheus, C. (1991). Knowledge Discovery Workbench. In G. Piatetsky-Shapiro (Ed.),Proc. AAAI-91 Workshop Knowledge Discovery in Databases, (pp. 11–24).
Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1989).Numerical Recipes in Pascal. Cambridge, UK: Cambridge University Press.
Shrager, J. and Langley, P. (eds.) (1990).Computational Models of Scientific Discovery and Theory Formation. San Mateo, CA: Morgan Kaufmann.
Spirtes, P., Glymour, C., and Scheines, R. (1993).Causation, Prediction and Search. New York: Springer-Verlag.
SPSS Reference Guide (1990). Chicago, IL: SPSS Inc.
Stevens, J. (1986).Applied Multivariate Statistics for the Social Sciences. Hillsdale, NJ: Lawrence Earlbaum Associates.
Tierney, L. (1990).Lisp-Stat: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics. New York: Wiley.
Zembowicz, R. and Żytkow, J.M. (1991). Automated discovery of empirical equations from data.Proc. ISMIS-91 Symp. (pp. 429–440). New York: Springer-Verlag.
Zembowicz, R. and Żytkow, J.M. (1992). Discovery of Regularities in Databases. In J. Żytkow (Ed.),Proc. ML-92 Workshop on Machine Discovery, (pp. 18–27). National Institute for Aviation Research, Wichita, KS.
Zembowicz, R. and Żytkow, J.M. (1992a). Discovery of Equations: Experimental Evaluation of Convergence. InProc. Tenth National Conf. Artif. Intel, (pp. 70–75). Menlo Park, CA: AAAI Press/MIT Press.
Żytkow, J.M. (1987). Combining many searches in the FAHRENHEIT discovery system.Proc. 4th Int. Workshop Machine Learning (pp. 281–287). Irvine, CA: Morgan Kaufmann.
Żytkow, J.M. (ed.) (1992).Proc. ML-92 Workshop on Machine Discovery (MD-92), National Institute for Aviation Research, Wichita, KS.
Żytkow, J., and Baker, J. (1991). Interactive Mining of Regularities in Databases. In G. Piatetsky-Shapiro and W. Frawley (Eds.),Knowledge Discovery in Databases. Menlo Park, CA: AAAI Press.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Żytkow, J.M., Zembowicz, R. Database exploration in search of regularities. J Intell Inf Syst 2, 39–81 (1993). https://doi.org/10.1007/BF01066546
Issue Date:
DOI: https://doi.org/10.1007/BF01066546