Abstract
We present a data mining method which integrates discretization, generalization and rough set feature selection. Our method reduces the data horizontally and vertically. In the first phase, discretization and generalization are integrated. Numeric attributes are discretized into a few intervals. The primitive values of symbolic attributes are replaced by high level concepts and some obvious superfluous or irrelevant symbolic attributes are also eliminated. The horizontal reduction is done by merging identical tuples after substituting an attribute value by its higher level value in a pre- defined concept hierarchy for symbolic attributes, or the discretization of continuous (or numeric) attributes. This phase greatly decreases the number of tuples we consider further in the database(s). In the second phase, a novel context- sensitive feature merit measure is used to rank features, a subset of relevant attributes is chosen, based on rough set theory and the merit values of the features. A reduced table is obtained by removing those attributes which are not in the relevant attributes subset and the data set is further reduced vertically without changing the interdependence relationships between the classes and the attributes. Finally, the tuples in the reduced relation are transformed into different knowledge rules based on different knowledge discovery algorithms. Based on these principles, a prototype knowledge discovery system DBROUGH-II has been constructed by integrating discretization, generalization, rough set feature selection and a variety of data mining algorithms. Tests on a telecommunication customer data warehouse demonstrates that different kinds of knowledge rules, such as characteristic rules, discriminant rules, maximal generalized classification rules, and data evolution regularities, can be discovered efficiently and effectively.
Similar content being viewed by others
References
H. Almuallim, T.G. Dietterich. Learning Boolean concepts in the presence of many irrelevant features. Artificial Intelligence 69(1–2), 279–305, 1994.
N. Cercone, W. Ziarko, X. Hu. Rule discovery from databases: A decision matrix approach. In: R. Zbigniew, M. Zemankova (eds.), Methodologies for Intelligent System, LNAI 1079, Springer-Verlag, 1996, pp.653-662
M. Chen, J. Han, P. Yu. Data mining: An overview from database perspective, IEEE Trans. Knowledge and Data Engineering 8(6), 866–883, 1996.
P. Domingos. Rule induction and instance-based learning: A unified approach. In: Proc. 14th International Joint Conference on AI, Morgan Kaufmann, 1995, pp.1226-1232.
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.). Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996.
M. Gams. New measurements highlight the importance of redundant knowledge. In: Proc. 3rd Europe Working Session on Learning, Springer-Verlag, 1989, pp.71-80.
J. Han, Y. Cai, N. Cercone. Knowledge discovery in databases: An attribute-oriented approach. In: Proc. 18th VLDB Conference, Morgan Kaufmann, 1992, pp.335-350.
R. Hilderman, C. Carter, H. Hamilton, N. Cercone. Mining association rules from market basket analysis using share measures and characterized itemsets, Int. J. Artificial Intelligence Tools, in press.
S. Hong. Use of contextual information for feature ranking and discretization, IEEE Trans. Knowledge and Data Engineering 9(3), 221–233, 1997.
X. Hu. Object aggregation and cluster identification: A knowledge discovery approach, Applied Math. Letter 7(4), 29–34, 1994.
X. Hu, N. Cercone, J. Xie. Learning data trend regularities from databases in a dynamic environment. In: Proc. AAAI Knowledge Discovery in Databases Workshop, 1994, pp.323-334
X. Hu, N. Cercone. Learning from relational databases, Computational Intelligence: An International Journal 12(5), 323–338, 1995.
X. Hu, N. Cercone. Rough set based similarity learning from databases. In: Proc. First International Conference on Knowledge Discovery and Data Mining, AAAI, 1995, pp.162-167.
X. Hu, N. Cercone. Mining knowledge rules from databases: A rough set approach. In: Proc. 12th International Conference on Data Engineering, 1996, pp.96-105.
G. John, R. Kohavi, K. Pfleger. Irrelevant features And the subset selection problem, Proc. 11th International Conference on Machine Learning, Morgan Kaufmann, 1994, pp.121-129.
R. Kerber. ChiMerge: Discretization of numeric attribute. In: Proc. 10th National Conference on AI, AAAI, 1992, pp.123-128.
K. Kira, L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In: Proc. 10th National Conference on AI, AAAI, 1992, pp.129-134.
R. Kohavi, H. John. Wrappers for feature subset selection, Artificial Intelligence Review (special issue in relevance), in press.
R. Kohavi, M. Sahami. Error-based and entropy-based discretization of continuous features. In: Proc. Second International Conference on Knowledge Discovery and Data Mining, AAAI, 1996, pp.114-119.
D. Koller, M. Sahami. Toward optimal feature selection. In: Proc. 13th International Conference on Machine Learning, Morgan Kaufmann, 1996, pp.284-292
S. Kullback. Information Theory and Statistics, Dover Publications, 1968.
UT.Y. Lin, N. Cercone. Rough Set and Data Mining: Analysis of Imprecise Data, Kluwer Academic Publisher, 1997.
H. Liu, R. Setiono. Chi2: Feature selection and discretization of numerical attributes. In: Proc. 8th IEEE Tools on AI, 1996, pp.388-391.
H. Liu, R. Setiono. A probabilistic approach to feature selection—A filter solution. In: Proc. 13th International Conference on Machine Learning, Morgan Kaufmann, 1996, pp.319-327.
R. Michalski, L. Mozetic, J. Hong, N. Lavrac. The Multi-purpose incremental learning system AQ15 and its testing application to three medical domains. In: Proc. 5th National Conference on AI, AAAI, 1986, pp. 1041-1045.
M. Modrzejewski. Feature selection using rough sets theory. In: Proc. 5th Europe Conference on Machine learning, Springer-Verlag, 1993, pp.78-85.
P. Murphy, W. Aha. UCI repository of machine learning databases, http://www.ics.uci.edu/mlearn/MLRepository.html, 1996.
P. Murphy, M. Pazzani. Exploring the decision forest: An empirical investigation of Occam’s razor in decision tree induction, J. Artificial Intelligence Research, 1(1), 257–275, Morgan Kaufmann, 1994.
Z. Pawlak. Rough sets: Theoretical aspects of reasoning about data, Kluwer Academic Publishers, 1992.
J. R. Quinlan. C4.5: Program for Machine Learning, Morgan Kaufmann, 1993.
R. Rymon. An SE-tree based characterization of the induction problem. In: Proc. 10th International Conference on Machine Learning, Morgan Kaufmann, 1993, pp.134-142.
E. Simoudis, J. Han, U. Fayyad (eds.). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1996.
L. Torgo. Controlled redundancy in incremental rule learning. In: Proc. 5th European Conference on Machine Learning, Springer-Verlag, 1993, pp. 185-195.
S. Tsumoto, S. Kobayashi, T. Yokpmori, H. Tanaka, A. Nakamura (eds.). Proceedings of the 4th International Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, The University of Tokyo Press, 1996.
W. Ziarko. Variable precision rough set model, J. Computer and System Sciences 46(1), Academic, 1993, 125–142.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Hu, X., Cercone, N. Data Mining via Discretization, Generalization and Rough Set Feature Selection. Knowledge and Information Systems 1, 33–60 (1999). https://doi.org/10.1007/BF03325090
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/BF03325090