Abstract
Databases store large amounts of information about consumer transactions and other kinds of transactions. This information can be used to deduce rules about consumer behavior, and the rules can in turn be used to determine company policies, for instance with regards to production, marketing and in several other areas. Since databases typically store millions of records, and each record could have up to 100 or more attributes, as an initial step it is necessary to reduce the size of the database by eliminating attributes that do not influence the decision at all or do so very minimally. In this paper we present techniques that can be employed effectively for exact and approximate reduction in a database system. These techniques can be implemented efficiently in a database system using SQL (structured query language) commands. We tested their performance on a real data set and validated them. The results showed that the classification performance actually improved with a reduced set of attributes as compared to the case when all the attributes were present. We also discuss how our techniques differ from statistical methods and other data reduction methods such as rough sets.
Similar content being viewed by others
References
Aasheim, O.T. and Solheim, H.G. (1996). Rough Sets as a Framework for Data Mining, Project Report, Knowledge Systems Group, The Norwegian University of Science and Technology, Trondheim.
Berenson, M., Levine, D., and Goldstein, M. (1983). Intermediate Statistical Methods and Applications, Prentice-Hall Publishers.
Breiman, et al. (1984). Classification and Regression Trees, Wadsworth Publishers.
Fayyad, U. et al. (1996). Advances in Knowledge Discovery and Data Mining, MIT Press.
Friedman, J.H. (1991). Multivariate Adaptive Regression Splines, The Annals of Statistics, 19, 1–141.
Korth, H. and Silberschatz, A. (1991). Database Systems Concepts (Second edition), McGraw Hill Publishers.
Kretowski, M. and Stepaniuk, J. (1996). Selection of objects and attributes a tolerance rough set approach. 9th Int. Symp. on Methodologies for Intelligent Systems, Poland.
Kumar, A., Rao, V.R., and Soni, H. (1995). An Empirical Comparison of Neural Network and Logistic Regression Models, Marketing Letters, 6, 251–263.
Kuncheva, L.I. (1992). Fuzzy Rough Sets: Applications to Feature Selection, Fuzzy Sets and Systems, 51, 147–153.
Mingers, J. (1989). An Empirical Comparison of Pruning Methods for Decision Tree Induction, Machine Learning, 4, 227–243.
Mollestad, T. and Skowron, A. (1996). A rough set framework for data mining of propositional default rules, 9th Int. Symp. on Methodologies for Intelligent Systems, Poland.
Nguyen S.H., Nguyen T.T., Polkowski L., Skowron A., Synak P., and Wroblewski J. (1996a). Decision rules for large data tables. Proc. CESA'96, France.
Nguyen S.H., Polkowski L., Skowron A., Synak P., and Wroblewski J. (1996b). Searching for approximate description of decision tables. Proc. 4th Int. Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, Tokyo.
Pawlak, Z. (1991). Rough Sets, Kluwer Academic Publishers.
Piatetsky-Shapiro, G. and Frawley, W. (1991). Knowledge Discovery in Databases, MIT Press.
Quinlan, J.R. (1986). Induction of Decision Trees, Machine Learning, 1, 86–106.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufman Publishers.
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). Learning Internal Representations by Error Propagation. In D.E. Rumelhart, J.L. Mclelland, and the PDP Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press.
Simoudis, E. et al. (1996). Integrating inductive and deductive reasoning for data mining. In (Fayyad et al., 1996).
Slowinski, R. (1992). Intelligent Decision Support: Handbook of Applications and Advances of Rough Set Theory, Kluwer Academic Publishers.
Slowinski, R. and Stefanowski, J. (1993). Handling various types of uncertainty in the rough set approach, Proc. Int. Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, Alberta, Canada.
Stepaniuk, J. and Kretowski, M. (1996). Similarity based rough sets and learning. 4th Int. Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, Tokyo.
Tanaka, H., Ishibuchi, H., and Shigenaga, T. (1992). Fuzzy inference system based on rough sets and its applications to medical diagnosis. In (Slowinski, 1992), pp. 111–117.
Weiss, S. and Kulikowski, C. (1991). Computer Systems that Learn, Morgan Kaufman Publishers.
Yasdi, R. (1991). Learning Classification Rules from Database in the Context of Knowledge-Acquisition and Representation, IEEE Transactions on Knowledge and Data Engineering, 3(3), 293–306.
Ziarko, W. (1991). The Discovery, Analysis and Representation of Data Dependencies in Databases. In (Piatetsky-Shapiro and Frawley, 1991), pp. 195–209.
Rights and permissions
About this article
Cite this article
Kumar, A. New Techniques for Data Reduction in a Database System for Knowledge Discovery Applications. Journal of Intelligent Information Systems 10, 31–48 (1998). https://doi.org/10.1023/A:1008633406999
Issue Date:
DOI: https://doi.org/10.1023/A:1008633406999