Skip to main content
Log in

Data Mining via Discretization, Generalization and Rough Set Feature Selection

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We present a data mining method which integrates discretization, generalization and rough set feature selection. Our method reduces the data horizontally and vertically. In the first phase, discretization and generalization are integrated. Numeric attributes are discretized into a few intervals. The primitive values of symbolic attributes are replaced by high level concepts and some obvious superfluous or irrelevant symbolic attributes are also eliminated. The horizontal reduction is done by merging identical tuples after substituting an attribute value by its higher level value in a pre- defined concept hierarchy for symbolic attributes, or the discretization of continuous (or numeric) attributes. This phase greatly decreases the number of tuples we consider further in the database(s). In the second phase, a novel context- sensitive feature merit measure is used to rank features, a subset of relevant attributes is chosen, based on rough set theory and the merit values of the features. A reduced table is obtained by removing those attributes which are not in the relevant attributes subset and the data set is further reduced vertically without changing the interdependence relationships between the classes and the attributes. Finally, the tuples in the reduced relation are transformed into different knowledge rules based on different knowledge discovery algorithms. Based on these principles, a prototype knowledge discovery system DBROUGH-II has been constructed by integrating discretization, generalization, rough set feature selection and a variety of data mining algorithms. Tests on a telecommunication customer data warehouse demonstrates that different kinds of knowledge rules, such as characteristic rules, discriminant rules, maximal generalized classification rules, and data evolution regularities, can be discovered efficiently and effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. H. Almuallim, T.G. Dietterich. Learning Boolean concepts in the presence of many irrelevant features. Artificial Intelligence 69(1–2), 279–305, 1994.

    Article  MathSciNet  MATH  Google Scholar 

  2. N. Cercone, W. Ziarko, X. Hu. Rule discovery from databases: A decision matrix approach. In: R. Zbigniew, M. Zemankova (eds.), Methodologies for Intelligent System, LNAI 1079, Springer-Verlag, 1996, pp.653-662

  3. M. Chen, J. Han, P. Yu. Data mining: An overview from database perspective, IEEE Trans. Knowledge and Data Engineering 8(6), 866–883, 1996.

    Article  Google Scholar 

  4. P. Domingos. Rule induction and instance-based learning: A unified approach. In: Proc. 14th International Joint Conference on AI, Morgan Kaufmann, 1995, pp.1226-1232.

  5. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.). Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996.

  6. M. Gams. New measurements highlight the importance of redundant knowledge. In: Proc. 3rd Europe Working Session on Learning, Springer-Verlag, 1989, pp.71-80.

  7. J. Han, Y. Cai, N. Cercone. Knowledge discovery in databases: An attribute-oriented approach. In: Proc. 18th VLDB Conference, Morgan Kaufmann, 1992, pp.335-350.

  8. R. Hilderman, C. Carter, H. Hamilton, N. Cercone. Mining association rules from market basket analysis using share measures and characterized itemsets, Int. J. Artificial Intelligence Tools, in press.

  9. S. Hong. Use of contextual information for feature ranking and discretization, IEEE Trans. Knowledge and Data Engineering 9(3), 221–233, 1997.

    Google Scholar 

  10. X. Hu. Object aggregation and cluster identification: A knowledge discovery approach, Applied Math. Letter 7(4), 29–34, 1994.

    Article  MATH  Google Scholar 

  11. X. Hu, N. Cercone, J. Xie. Learning data trend regularities from databases in a dynamic environment. In: Proc. AAAI Knowledge Discovery in Databases Workshop, 1994, pp.323-334

  12. X. Hu, N. Cercone. Learning from relational databases, Computational Intelligence: An International Journal 12(5), 323–338, 1995.

    Google Scholar 

  13. X. Hu, N. Cercone. Rough set based similarity learning from databases. In: Proc. First International Conference on Knowledge Discovery and Data Mining, AAAI, 1995, pp.162-167.

  14. X. Hu, N. Cercone. Mining knowledge rules from databases: A rough set approach. In: Proc. 12th International Conference on Data Engineering, 1996, pp.96-105.

  15. G. John, R. Kohavi, K. Pfleger. Irrelevant features And the subset selection problem, Proc. 11th International Conference on Machine Learning, Morgan Kaufmann, 1994, pp.121-129.

  16. R. Kerber. ChiMerge: Discretization of numeric attribute. In: Proc. 10th National Conference on AI, AAAI, 1992, pp.123-128.

  17. K. Kira, L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In: Proc. 10th National Conference on AI, AAAI, 1992, pp.129-134.

  18. R. Kohavi, H. John. Wrappers for feature subset selection, Artificial Intelligence Review (special issue in relevance), in press.

  19. R. Kohavi, M. Sahami. Error-based and entropy-based discretization of continuous features. In: Proc. Second International Conference on Knowledge Discovery and Data Mining, AAAI, 1996, pp.114-119.

  20. D. Koller, M. Sahami. Toward optimal feature selection. In: Proc. 13th International Conference on Machine Learning, Morgan Kaufmann, 1996, pp.284-292

  21. S. Kullback. Information Theory and Statistics, Dover Publications, 1968.

  22. UT.Y. Lin, N. Cercone. Rough Set and Data Mining: Analysis of Imprecise Data, Kluwer Academic Publisher, 1997.

  23. H. Liu, R. Setiono. Chi2: Feature selection and discretization of numerical attributes. In: Proc. 8th IEEE Tools on AI, 1996, pp.388-391.

  24. H. Liu, R. Setiono. A probabilistic approach to feature selection—A filter solution. In: Proc. 13th International Conference on Machine Learning, Morgan Kaufmann, 1996, pp.319-327.

  25. R. Michalski, L. Mozetic, J. Hong, N. Lavrac. The Multi-purpose incremental learning system AQ15 and its testing application to three medical domains. In: Proc. 5th National Conference on AI, AAAI, 1986, pp. 1041-1045.

  26. M. Modrzejewski. Feature selection using rough sets theory. In: Proc. 5th Europe Conference on Machine learning, Springer-Verlag, 1993, pp.78-85.

  27. P. Murphy, W. Aha. UCI repository of machine learning databases, http://www.ics.uci.edu/mlearn/MLRepository.html, 1996.

  28. P. Murphy, M. Pazzani. Exploring the decision forest: An empirical investigation of Occam’s razor in decision tree induction, J. Artificial Intelligence Research, 1(1), 257–275, Morgan Kaufmann, 1994.

    MATH  Google Scholar 

  29. Z. Pawlak. Rough sets: Theoretical aspects of reasoning about data, Kluwer Academic Publishers, 1992.

  30. J. R. Quinlan. C4.5: Program for Machine Learning, Morgan Kaufmann, 1993.

  31. R. Rymon. An SE-tree based characterization of the induction problem. In: Proc. 10th International Conference on Machine Learning, Morgan Kaufmann, 1993, pp.134-142.

  32. E. Simoudis, J. Han, U. Fayyad (eds.). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1996.

  33. L. Torgo. Controlled redundancy in incremental rule learning. In: Proc. 5th European Conference on Machine Learning, Springer-Verlag, 1993, pp. 185-195.

  34. S. Tsumoto, S. Kobayashi, T. Yokpmori, H. Tanaka, A. Nakamura (eds.). Proceedings of the 4th International Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, The University of Tokyo Press, 1996.

  35. W. Ziarko. Variable precision rough set model, J. Computer and System Sciences 46(1), Academic, 1993, 125–142.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, X., Cercone, N. Data Mining via Discretization, Generalization and Rough Set Feature Selection. Knowledge and Information Systems 1, 33–60 (1999). https://doi.org/10.1007/BF03325090

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF03325090

Keywords

Navigation