Decision tree induction using a fast splitting attribute selection for large datasets
Highlights
► This paper presents DTFS a new algorithm for building decision trees from large datasets. ► DTFS is faster than previous algorithms for building decision trees from large datasets. ► DTFS processes the instances incrementally and it does not store the whole training set in memory. ► If the number of attributes increases, DTFS has better behavior than previous algorithms.
Introduction
Classification is an important task in data mining (Tan, Steinbach, & Kumar, 2006). Currently, there are many classification problems where large training datasets are available, therefore there is a big interest for developing classifiers that allow handling this kind of datasets in a reasonable time.
Decision trees (Quinlan, 1986, Quinlan, 1993) are commonly used for solving classification problems in Machine Learning and Pattern Recognition. A DT is formed by internal nodes, leaves, and edges, and it can be induced from a training set of instances, each one represented by a tuple of attribute values and a class label. Internal nodes have a splitting attribute and each node has one or more children (edges). Each one of these children has associated a value for the splitting attribute and these values determine the path to be followed during a tree traversal. Each leaf has associated a class label. In order to classify a new instance, the tree is traversed from the root to a leaf, when the new instance arrives to a leaf it is classified according to the class label associated to that leaf.
Several algorithms have been developed for building DTs from large datasets (Alsabti et al., 1998, Domingos and Hulten, 2000, Gehrke et al., 1998, Gehrke et al., 2000, Gehrke et al., 1999, Mehta et al., 1996, Shafer et al., 1996, Yang et al., 2008, Yoon et al., 1999). However, almost all of them have spatial restrictions, because they have to keep the whole training set in main memory and some other use a representation of the attributes that requires more space than the whole training set. On the other hand, in those algorithms without spatial restrictions, the construction of the DT is based only on a small subset, but for obtaining this subset additional time is required, which could be too expensive for large training sets; or the algorithms uses several parameters, which could be very difficult to determine.
Having these drawbacks identified, this work introduces a new algorithm for building DTs that solves these problems. Our algorithm (DTFS) follows two main ideas for building DTs, it uses a fast splitting attribute selection for expanding nodes (deleting the instances stored in the expanded node after its expansion) and processes all the instances of the training set in an incremental way, therefore it is not necessary to store the whole training set in main memory.
In the literature some new techniques to select splitting attributes have been proposed (Berzal et al., 2004, Chandra and Paul Varghese, 2009, Ouyang et al., 2009), however these techniques are not proposed for handling large datasets, because some of them have to evaluate a lot of candidate splits for choosing the best attribute, other use discretization methods to deal with numerical attributes, and some other use expensive techniques to expand nodes. On the other hand, several algorithms for building DTs in an incremental way have been proposed, such as ID5R (Utgoff, 1989), PT2 (Utgoff & Brodley, 1990), ITI (Utgoff, 1994), StreamTree (Jin & Agrawal, 2003) and UFFT (Gama & Medas, 2005), however these algorithms cannot handle large datasets either, because they need to keep the whole training set in main memory for building the DT.
In this paper, we propose an algorithm that processes the training instances one by one, thus each training instance traverses the DT until a leaf is reached, where the training instance will be stored. In our algorithm, when a leaf has stored a predefined number of instances (a parameter of the algorithm), it will be expanded choosing a splitting attribute, using the instances in the leaf, and creating an edge for each class of instances in the leaf. After expanding a leaf, the instances stored in that leaf are deleted. Experimental results over several large datasets show that our algorithm is faster than three of the most recent algorithms for building DTs for large datasets, obtaining a competitive accuracy.
The rest of the paper is organized as follows. Section 2 gives an overview of the works related to DT induction for large datasets. Section 3 introduces the DTFS algorithm, which allows building DTs for large datasets. Section 4 provides experimental results and a comparison against other algorithms for DT induction for large datasets, on both real and synthetic datasets. Finally, Section 5 gives our conclusions and some directions for future work.
Section snippets
Related work
In this section, several algorithms that have been proposed to build DTs for large datasets are described.
Mehta et al. (1996) presented SLIQ (Supervised Learning In Quest), an algorithm for building DTs for large datasets. This algorithm uses a list structure for each attribute, these lists are used in order to avoid storing the whole training set in main memory, by storing them in disk. However, SLIQ uses an extra list that must be stored in main memory, this list contains the class of each
Proposed algorithm
In this work, we propose a new algorithm for building DTs for large datasets (DTFS) that overcomes the shortcomings of the related algorithms. In order to avoid storing the whole training set in main memory, DTFS builds DTs in an incremental way. Thus the training instances will be processed one by one, traversing the DT with each one, until it reaches a leaf, where the instance will be stored. Besides, to avoid storing all the training instances into the tree, a leaf stores at most s instances
Experimental results
The experiments were conducted in four directions. First in Section 4.1 we analyze the behavior of our algorithm when the parameter s (the maximum number of instances in a leaf) varies. In Section 4.2 we analyze the behavior of DTFS when the number of attributes in the training set varies, since we want to show that our algorithm does a fast selection of the splitting attribute no matter the number of attributes in the dataset. Additionally, in Section 4.3 we present a comparison among DTFS and
Conclusions and future work
In this work, we have introduced a decision tree induction algorithm, called DTFS, which uses a fast splitting attribute selection for expanding nodes. Our algorithm does not require to store the whole training set in memory and processes all the instances in the training set. The key insight is to process one by one the instances for updating the DT with each one (processing the data in an incremental way) and to use a small number of instances for expanding a leaf (discarding them after the
Acknowledgement
Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the US Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is http://www.sdss.org/.
The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating
References (26)
- et al.
Building multi-way decision trees with numerical attributes
Information Sciences
(2004) - et al.
Moving towards efficient decision tree construction
Information Sciences
(2009) - et al.
Induction of multiclass multifeature split decision trees from distributed data
Pattern Recognition
(2009) - Alsabti, K., Ranka, S., & Singh, V. (1998). CLOUDS: A decision tree classifier for large datasets. In Proceedings of...
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
(2006)- et al.
Mining high-speed data streams
- Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In 16th international conference on...
- et al.
Learning decision trees from dynamic data streams
Journal of Universal Computer Science
(2005) - Gehrke, J., Ramakrishnan, R., Ganti, V. (1998). Rainforest-A framework for fast decision tree classification of large...
- et al.
BOAT – Optimistic decision tree construction
ACM SIGMOD Record
(1999)
Rainforest – A framework for fast decision tree construction of large datasets
Data Mining and Knowledge Discovery
Cited by (15)
Pattern recognition in Latin America in the "big data" era
2015, Pattern RecognitionCitation Excerpt :In [124], the authors present a method to train a regression SVM when the number of training samples is large and apply it to datasets with up to 40,000,000 elements in the training sample and feature vectors with 262,000 elements. The issue of computational performance is explored for tree classifiers (which typically are memory demanding) in [66], which avoids storing the whole training dataset in memory (see also [67]). A study on the impact of feature space dimension on the performance of various binary classifiers is provided in [22].
Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks
2014, Expert Systems with ApplicationsCitation Excerpt :There are two common issues for the construction of decision trees: (a) the growth of the tree to enable it to accurately categorize the training dataset, and (b) the pruning stage, whereby superfluous nodes and branches are removed in order to improve classification accuracy. Franco-Arcega, Carrasco-Ochoa, Sanchez-Diaz, and Martinez-Trinidad (2011) presented decision trees using fast splitting attribute selection (DTFS), an algorithm for building DTs for large datasets. DTFS used this attribute selection technique to expand nodes and process all the training instances in an incremental way.
Combining Fuzzy Partitioning and Incremental Methods to Construct a Scalable Decision Tree on Large Datasets
2023, International Journal of Uncertainty, Fuzziness and Knowledge-Based SystemsApplicability of classifier to discovery knowledge for future prediction modelling
2023, Journal of Ambient Intelligence and Humanized ComputingHybrid machine learning approach for landslide prediction, Uttarakhand, India
2022, Scientific ReportsScalable decision tree based on fuzzy partitioning and an incremental approach
2022, International Journal of Electrical and Computer Engineering