Dimensionality optimization by heuristic greedy learning vs. genetic algorithms in knowledge discovery and data mining

https://doi.org/10.1016/S1088-467X(99)00019-0Get rights and content

Abstract

Dimensionality optimization involves optimizing the size of data sets from both dimensions, variable and observation selections. The ultimate objective of dimensionality optimization is to obtain the induced data space, by reducing both dimensionalities in such a way that the reduced subset could retain sufficient information. In most real-world applications, it is not known what the best subset is and what should be contained in such a subset. Selecting the appropriate subset is extremely important in effectively mining over large data sets in the sense that it is the only source for any data mining and knowledge discovery algorithm to work with the data of interest reliably.

The statistical as well as artificial intelligence community has provided good methods in this domain, but still a lot of improvements need to be made, especially for data mining applications. This paper introduces a heuristic methodology that integrates heuristic greedy search methods and tree-structured SampleC4.5 to efficiently find the optimal subset of very large data sets from both dimensions simultaneously. A GA-based optimization approach is also proposed in the paper. Experimental results are presented which illustrate the effectiveness of our approaches in digging out the important underlying patterns, and indicate the potential advantages of the proposed techniques to improve the optimizing process while staying out of misleading dilemma. The results of our experiments also show the robustness of our approaches and complementary characteristics for knowledge discovery and data mining tasks.

Introduction

Contemporary corporations have been increasingly storing large quantities of all varieties of business data. They usually possess different kinds of data structures cross industries, but within industry, especially within corporations, the well-maintained structures of corporate data sets basically remain relatively stable, at least in a period of time, although the amount of data keeps increasing exponentially. Data mining applications undoubtedly have great potentials for finding interesting, nontrivial, potentially valuable patterns over these large data sets [9], [16]. However, it is also very challenging to work with these data sets due to their large, complex dimensionality, highly correlated variables, and redundant and very noisy data. It is naturally very difficult to handle too many variables and their relationships at any instant, and very noisy data inside the data sets could mislead any targeted research and applications. These “bad” characteristics make the data sets difficult to mine by traditional approaches. As an initial but crucial step, it is necessary to optimize the studied data sets by reducing their unthinkable sizes, the process of which is selecting the optimal subset which could considerably well represent the original data sets in data mining and knowledge discovery applications.

Dimensionality optimization of the data set involves optimizing the data set size from both dimensions simultaneously, variable or attribute selection, and record or observation selection. The ultimate objective of dimensionality optimization is to obtain the induced data space, by reducing both dimensions of the original data set, in such a way that the reduced optimal subset could retain sufficient information. In some situations, the application specific strategies could be developed based on the domain knowledge in order to reduce the data set to a manageable size [8]. However, in most real-world applications, the domain knowledge is not a priori. It is also not known what the best subset is and what should be contained in such a subset. Numerous algorithms and approaches have been previously developed for data reduction, from artificial intelligence [1], [16], [17], data mining [9] as well as statistical community [2]. What data mining really needs to do is to efficiently find the optimal subset of very large data sets from both variable and observation perspectives. We could then interpret the underlying data structures, and reliably dig out the nontrivial, potentially valuable knowledge for commercial applications and further in-depth research and study.

We proposed two effective learning techniques on dimensionality optimization in data mining applications. The heuristic-based learning methodology is based on the updated information-theoretic SampleC4.5, in which we study the optimization process while sampling the specifically required percentage subset of the entire data set. This iterative process proceeds to digging out the underlying important data structures to reach the optimal dimensionality for knowledge discovery. In the process, we could obtain the best splitting threshold, and find the optimal subset for the data set, and more importantly, the optimal or near-optimal solutions or recommendations for future research and commercial applications. We trace the classification performance on the induced decision trees over iterations based on its effects on unseen data. We make overall comparisons across selected models, along with other statistical criterion. Through the heuristic process, we could obtain robust structural solutions about the data set patterns. For the exhaustive learning approach on genetic algorithms, we develop the technique in such a way that it could provide good non-priori global solutions and valuable insights for the studied data sets.

In this paper, we provide insights of these two effective dimensionality optimization techniques, compare their advantages and introduce the strengths and complimentary characteristics in the knowledge discovery domain. The paper is organized as follows: an overview of some dimensionality reduction techniques is provided in the next section. Sections 3 and 4 present the proposed optimization learning methods. Experimental results with the SocioOlympic data set are given in Sections 5, and 6 concludes the paper with a brief discussion in this study.

Section snippets

Techniques for dimensionality reduction

Dimensionality optimization involves optimizing the size of data sets from both dimensions, variable or attribute selection and record or observation selection. The ultimate objective of dimensionality optimization is to obtain the induced data space, by reducing both dimensionalities of the original data set, in such a way that the reduced subset could retain sufficient information. In most real-world applications, it is not known what the best subset is and what should be contained in such a

Heuristic greedy learning by sampleC4.5

The basic idea of our methodology is to implement heuristic greedy search over all possible subset spaces with the use of updated SampleC4.5 algorithm. We attempt to reduce the number of observations in the data sets in one dimension and select the variable subset from raw data in the other dimension simultaneously. The learning process proceeds until finding out the optimal or near-optimal subset with low dimensionality and high discriminating power. In order to achieve this, we choose the

Dimensionality optimization by genetic algorithms

Genetic algorithms (GAs) [3], [6], [11], a form of adaptive search techniques initially introduced by Holland [12], are being used in a wider variety of applications, from original classical combinatorial optimization problems, to the classification problems in data mining domain. Genetic algorithms are basically domain independent search technique in that they exploit accumulating information about an initially unknown search space in order to bias subsequent search into more promising

Experimental results

In this section the SocioOlympic data set that involved in the application of both proposed learning methods is described in more detail, followed by the discussion of the experimental results. We implement both proposed algorithms on Sun SPARC 20, Solaris 2.5.1.

Summary and conclusions

This paper introduces a heuristic methodology and a GA-based optimization approach. It is well known that GA-based optimization approaches could easily avoid local optima but at the expense of increasing the computational effort. With SampleC4.5 embedded in the proposed GA-based algorithm, we could avoid loading the phenomenal data sets all at once by incremental sampling. The evaluation process of genetic algorithms is time-consuming, but obviously there is a trade-off between best solutions

References (24)

  • J. Bala, J. Huang, H. Vafaie, K. DeJong, H. Wechsler. Heuristic learning using genetic algorithms and decision trees...
  • M. Bereson, D. Levine, M. Goldstein, Intermediate statistical Methods and Applications, Prentice Hall, Englewood...
  • A. Bethke. Genetic algorithms as function optimizers, Ph.D. Dissertation Department of Computer and Communication...
  • J. Chattratichat et al., Large scale data mining challenges and responses, in: Proceedings of the Third International...
  • E. Condon, Predicting the Success of Nations in the Summer Olympics Using Neural Networks, Master’s thesis, University...
  • K. DeJong

    Adaptive system design a genetic approach

    IEEE Transaction on Systems Man and Cybernetics

    (1980)
  • K. DeJong, Learning with genetic algorithms: an overview, Machine earning, Vol. 3, Kluwer Academic Publishers,...
  • B. Dom, W. Nibalck, J. Sheinvald, Variable selection with stochastic complexity, in: Proceedings of the IEEE Conference...
  • U. Fayyad et al., Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge, MA,...
  • D. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning, Addison-Wesley, Reading, MA,...
  • J. Grefenstette et al., Genesis, OOGA, Two Genetic Algorithms Systems, TSP: Melorse, MA,...
  • J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor,...
  • Cited by (6)

    • DEA based data preprocessing for maximum decisional efficiency linear case valuation models

      2012, Expert Systems with Applications
      Citation Excerpt :

      The advantages of maximum entropy over maximum likelihood and its connection to minimum description length principle have been discussed in the literature (Feder, 1986). Since maximum entropy is a special case of minimum description length principle (Feder, 1986), when both procedures are combined, the resulting DEA-MDE procedure appears to provide a solution to the dimensionality reduction problem (Fu, 1999) providing best solution in terms of number of decision-making attributes, number of examples and likelihood of the data distribution of unobserved sample values. The proposed DEA based screening procedure is general and may be applied to any classification procedure as long as two conditions are satisfied.

    • DEA based dimensionality reduction for classification problems satisfying strict non-satiety assumption

      2011, European Journal of Operational Research
      Citation Excerpt :

      The dimensionality reduction problem consists of compressing data matrix D into a u × g matrix C of smaller dimension (1 < u < n and 1 < g < m) such that the compressed matrix C represents most of the information contained in the original data matrix D (Han and Kamber, 2006). Among the reasons to reduce dimensionality are improvements in generalization, better memory utilization and improvement in computational efficiency (Fu, 1999; Unler and Murat, 2010; Meisel and Mattfeld, 2010). A special case of the dimensionality reduction problem is called the feature selection problem where a compressed matrix C is sought with dimensionality n × g by only lowering the dimension of the attributes vector.

    • Engineering design and rapid prototyping

      2010, Engineering Design and Rapid Prototyping
    • Data-mining process overview

      2008, Collaborative Engineering: Theory and Practice
    • Data mining

      2002, Annual Review of Information Science and Technology
    View full text