Decision tree induction using a fast splitting attribute selection for large datasets

https://doi.org/10.1016/j.eswa.2011.05.087Get rights and content

Abstract

Several algorithms have been proposed in the literature for building decision trees (DT) for large datasets, however almost all of them have memory restrictions because they need to keep in main memory the whole training set, or a big amount of it, and such algorithms that do not have memory restrictions, because they choose a subset of the training set, need extra time for doing this selection or have parameters that could be very difficult to determine. In this paper, we introduce a new algorithm that builds decision trees using a fast splitting attribute selection (DTFS) for large datasets. The proposed algorithm builds a DT without storing the whole training set in main memory and having only one parameter but being very stable regarding to it. Experimental results on both real and synthetic datasets show that our algorithm is faster than three of the most recent algorithms for building decision trees for large datasets, getting a competitive accuracy.

Highlights

► This paper presents DTFS a new algorithm for building decision trees from large datasets. ► DTFS is faster than previous algorithms for building decision trees from large datasets. ► DTFS processes the instances incrementally and it does not store the whole training set in memory. ► If the number of attributes increases, DTFS has better behavior than previous algorithms.

Introduction

Classification is an important task in data mining (Tan, Steinbach, & Kumar, 2006). Currently, there are many classification problems where large training datasets are available, therefore there is a big interest for developing classifiers that allow handling this kind of datasets in a reasonable time.

Decision trees (Quinlan, 1986, Quinlan, 1993) are commonly used for solving classification problems in Machine Learning and Pattern Recognition. A DT is formed by internal nodes, leaves, and edges, and it can be induced from a training set of instances, each one represented by a tuple of attribute values and a class label. Internal nodes have a splitting attribute and each node has one or more children (edges). Each one of these children has associated a value for the splitting attribute and these values determine the path to be followed during a tree traversal. Each leaf has associated a class label. In order to classify a new instance, the tree is traversed from the root to a leaf, when the new instance arrives to a leaf it is classified according to the class label associated to that leaf.

Several algorithms have been developed for building DTs from large datasets (Alsabti et al., 1998, Domingos and Hulten, 2000, Gehrke et al., 1998, Gehrke et al., 2000, Gehrke et al., 1999, Mehta et al., 1996, Shafer et al., 1996, Yang et al., 2008, Yoon et al., 1999). However, almost all of them have spatial restrictions, because they have to keep the whole training set in main memory and some other use a representation of the attributes that requires more space than the whole training set. On the other hand, in those algorithms without spatial restrictions, the construction of the DT is based only on a small subset, but for obtaining this subset additional time is required, which could be too expensive for large training sets; or the algorithms uses several parameters, which could be very difficult to determine.

Having these drawbacks identified, this work introduces a new algorithm for building DTs that solves these problems. Our algorithm (DTFS) follows two main ideas for building DTs, it uses a fast splitting attribute selection for expanding nodes (deleting the instances stored in the expanded node after its expansion) and processes all the instances of the training set in an incremental way, therefore it is not necessary to store the whole training set in main memory.

In the literature some new techniques to select splitting attributes have been proposed (Berzal et al., 2004, Chandra and Paul Varghese, 2009, Ouyang et al., 2009), however these techniques are not proposed for handling large datasets, because some of them have to evaluate a lot of candidate splits for choosing the best attribute, other use discretization methods to deal with numerical attributes, and some other use expensive techniques to expand nodes. On the other hand, several algorithms for building DTs in an incremental way have been proposed, such as ID5R (Utgoff, 1989), PT2 (Utgoff & Brodley, 1990), ITI (Utgoff, 1994), StreamTree (Jin & Agrawal, 2003) and UFFT (Gama & Medas, 2005), however these algorithms cannot handle large datasets either, because they need to keep the whole training set in main memory for building the DT.

In this paper, we propose an algorithm that processes the training instances one by one, thus each training instance traverses the DT until a leaf is reached, where the training instance will be stored. In our algorithm, when a leaf has stored a predefined number of instances (a parameter of the algorithm), it will be expanded choosing a splitting attribute, using the instances in the leaf, and creating an edge for each class of instances in the leaf. After expanding a leaf, the instances stored in that leaf are deleted. Experimental results over several large datasets show that our algorithm is faster than three of the most recent algorithms for building DTs for large datasets, obtaining a competitive accuracy.

The rest of the paper is organized as follows. Section 2 gives an overview of the works related to DT induction for large datasets. Section 3 introduces the DTFS algorithm, which allows building DTs for large datasets. Section 4 provides experimental results and a comparison against other algorithms for DT induction for large datasets, on both real and synthetic datasets. Finally, Section 5 gives our conclusions and some directions for future work.

Section snippets

Related work

In this section, several algorithms that have been proposed to build DTs for large datasets are described.

Mehta et al. (1996) presented SLIQ (Supervised Learning In Quest), an algorithm for building DTs for large datasets. This algorithm uses a list structure for each attribute, these lists are used in order to avoid storing the whole training set in main memory, by storing them in disk. However, SLIQ uses an extra list that must be stored in main memory, this list contains the class of each

Proposed algorithm

In this work, we propose a new algorithm for building DTs for large datasets (DTFS) that overcomes the shortcomings of the related algorithms. In order to avoid storing the whole training set in main memory, DTFS builds DTs in an incremental way. Thus the training instances will be processed one by one, traversing the DT with each one, until it reaches a leaf, where the instance will be stored. Besides, to avoid storing all the training instances into the tree, a leaf stores at most s instances

Experimental results

The experiments were conducted in four directions. First in Section 4.1 we analyze the behavior of our algorithm when the parameter s (the maximum number of instances in a leaf) varies. In Section 4.2 we analyze the behavior of DTFS when the number of attributes in the training set varies, since we want to show that our algorithm does a fast selection of the splitting attribute no matter the number of attributes in the dataset. Additionally, in Section 4.3 we present a comparison among DTFS and

Conclusions and future work

In this work, we have introduced a decision tree induction algorithm, called DTFS, which uses a fast splitting attribute selection for expanding nodes. Our algorithm does not require to store the whole training set in memory and processes all the instances in the training set. The key insight is to process one by one the instances for updating the DT with each one (processing the data in an incremental way) and to use a small number of instances for expanding a leaf (discarding them after the

Acknowledgement

Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the US Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is http://www.sdss.org/.

The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating

References (26)

  • F. Berzal et al.

    Building multi-way decision trees with numerical attributes

    Information Sciences

    (2004)
  • B. Chandra et al.

    Moving towards efficient decision tree construction

    Information Sciences

    (2009)
  • J. Ouyang et al.

    Induction of multiclass multifeature split decision trees from distributed data

    Pattern Recognition

    (2009)
  • Alsabti, K., Ranka, S., & Singh, V. (1998). CLOUDS: A decision tree classifier for large datasets. In Proceedings of...
  • J. Demsar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • P. Domingos et al.

    Mining high-speed data streams

  • Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In 16th international conference on...
  • J. Gama et al.

    Learning decision trees from dynamic data streams

    Journal of Universal Computer Science

    (2005)
  • Gehrke, J., Ramakrishnan, R., Ganti, V. (1998). Rainforest-A framework for fast decision tree classification of large...
  • J. Gehrke et al.

    BOAT – Optimistic decision tree construction

    ACM SIGMOD Record

    (1999)
  • J. Gehrke et al.

    Rainforest – A framework for fast decision tree construction of large datasets

    Data Mining and Knowledge Discovery

    (2000)
  • Jin R., & Agrawal G. (2003). Efficient decision tree construction on streaming data. In Proceedings of ninth ACM SIGKDD...
  • Li, Z., Wang, T., Wang, R., Yan, Y., & Chen, H. (2007). A new fuzzy decision tree classification method for mining...
  • Cited by (15)

    • Pattern recognition in Latin America in the "big data" era

      2015, Pattern Recognition
      Citation Excerpt :

      In [124], the authors present a method to train a regression SVM when the number of training samples is large and apply it to datasets with up to 40,000,000 elements in the training sample and feature vectors with 262,000 elements. The issue of computational performance is explored for tree classifiers (which typically are memory demanding) in [66], which avoids storing the whole training dataset in memory (see also [67]). A study on the impact of feature space dimension on the performance of various binary classifiers is provided in [22].

    • Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks

      2014, Expert Systems with Applications
      Citation Excerpt :

      There are two common issues for the construction of decision trees: (a) the growth of the tree to enable it to accurately categorize the training dataset, and (b) the pruning stage, whereby superfluous nodes and branches are removed in order to improve classification accuracy. Franco-Arcega, Carrasco-Ochoa, Sanchez-Diaz, and Martinez-Trinidad (2011) presented decision trees using fast splitting attribute selection (DTFS), an algorithm for building DTs for large datasets. DTFS used this attribute selection technique to expand nodes and process all the training instances in an incremental way.

    • Combining Fuzzy Partitioning and Incremental Methods to Construct a Scalable Decision Tree on Large Datasets

      2023, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
    • Applicability of classifier to discovery knowledge for future prediction modelling

      2023, Journal of Ambient Intelligence and Humanized Computing
    • Scalable decision tree based on fuzzy partitioning and an incremental approach

      2022, International Journal of Electrical and Computer Engineering
    View all citing articles on Scopus
    View full text