Adaptive neural tree exploiting expert nodes to classify high-dimensional data

doi:10.1016/j.neunet.2019.12.029

Neural Networks

Volume 124, April 2020, Pages 20-38

https://doi.org/10.1016/j.neunet.2019.12.029 Get rights and content

Abstract

Classification of high dimensional data suffers from curse of dimensionality and over-fitting. Neural tree is a powerful method which combines a local feature selection and recursive partitioning to solve these problems, but it leads to high depth trees in classifying high dimensional data. On the other hand, if less depth trees are used, the classification accuracy decreases or over-fitting increases. This paper introduces a novel Neural Tree exploiting Expert Nodes (NTEN) to classify high-dimensional data. It is based on a decision tree structure, whose internal nodes are expert nodes performing multi-dimensional splitting. Any expert node has three decision-making abilities. Firstly, they can select the most eligible neural network with respect to the data complexity. Secondly, they evaluate the over-fitting. Thirdly, they can cluster the features to jointly minimize redundancy and overlapping. To this aim, metaheuristic optimization algorithms including GA, NSGA-II, PSO and ACO are applied. Based on these concepts, any expert node splits a class when the over-fitting is low, and clusters the features when the over-fitting is high. Some theoretical results on NTEN are derived, and experiments on 35 standard data show that NTEN reaches good classification results, reduces tree depth without over-fitting and degrading accuracy.

Introduction

High-dimensional data classification causes some significant challenges in basic classifiers, including curse of dimensionality and over-fitting, which makes them impractical and insufficient to large sets of real problems (Shi, Liu, Qi, & Wang, 2018). One of the imposing solutions to these challenges is the dimension reduction (Shi et al., 2018). In addition, data partitioning could help to achieve simpler classifiers in each partition (Castro, Georgiopoulos, Demara, & Gonzalez, 2005). Neural trees by combining a local feature selection and recursive partitioning, could be helpful in this case. Moreover, neural trees by borrowing the fast training phase and the good generalization ability of the decision trees and the strong classification ability of neural networks, lead to powerful classifiers, even in high dimensional data (Federer & Zylberberg, 2018). There exist different neural trees with good achievements in different problems like Perceptron tree (Utgoff, 1989), neural tree with linear discriminant (Rani, Kumar, Micheloni, & Foresti, 2013), decision tree with bounded error (Saettler, Laber, & Pereira, 2017), balanced neural tree (Micheloni, Rani, Kumar, & Foresti, 2012), neural tree with multi-dimensional split (Maji, 2008), generalized neural tree (Foresti & Micheloni, 2002), omnivariate neural tree (Yildiz & Alpaydin, 2001) and neural trees with knowledge transferring (Abpeykar & Ghatee, 2018). In addition, some neural trees achieved significant results in dealing with high dimensional data including CART (Castro et al., 2005), SAINT (Federer & Zylberberg, 2018), adaptive high-order neural tree (Foresti & Dolso, 2004), decision forest of RBF networks (Abpeykar, Ghatee, & Zare, 2019) and neural trees with P2P and SC knowledge transferring (Abpeykar & Ghatee, 2019). By the way there are some dilemmas, which existing neural trees do not consider all of them jointly:

1.
Accurate classification of high dimensional feature space leads to more depth trees, thus achieving less depth neural trees require more complex computations at each node (Foresti and Micheloni, 2002, Micheloni et al., 2012, Rani et al., 2013).
2.
Multi-dimensional split is considered in some neural tree models (Foresti and Dolso, 2004, Maji, 2008, Micheloni et al., 2012, Ojha et al., 2017), but no one analyzes the data complexity for splitting, which leads to more accurate classification in neural tree’s node.
3.
The data partition of each neural tree node is considered as a sub-problem (Utgoff, 1989). Each sub-problem needs an eligible neural network, which leads to accurate and less depth trees (Foresti and Dolso, 2004, Micheloni et al., 2012). There are some studies in hybrid neural trees with different MLPs, but there is not any neural tree with expert nodes. By applying some if–then rules, these nodes can be extended to select an eligible neural network with respect to data complexity.
4.
Because of huge amount of redundancy in high-dimensional feature spaces, the neural tree faces with over-fitting. Feature clustering by using expert nodes can be used to reduce this problem.

NTEN is characterized by three main novelties, which consider the mentioned dilemmas. These novelties are: (1) multi-dimensional split of feature space, (2) selection of the most appropriate neural network for each local training set and (3) feature clustering in expert nodes in case of over-fitting by applying metaheuristic optimization algorithms. In the third novelty different algorithms are applied for feature clustering, like Genetic Algorithm (GA) (Silva Filho, Souza, & Prudêncio, 2016), Non-dominated Sorted Genetic Algorithm (NSGA-II) (Deb, Pratap, Agarwal, & Meyarivan, 2002), Particle Swarm Optimization (PSO) (Silva Filho et al., 2016) and Ant Colony Optimization (ACO) (Silva Filho et al., 2016). In traditional decision trees, feature selection at each node is done on the basis of entropy (Altınçay, 2007), Gini Index (Hady, Schwenker, & Palm, 2010) and miss classification errors (Chen & Hung, 2009). Data split based on these methods, leads to deep trees on data with high-dimensional feature spaces and needs huge computation. In these cases, a multi-dimensional split could be helpful (Foresti and Dolso, 2004, Foresti and Micheloni, 2002, Rani et al., 2013). NTEN applies multi-dimensional split in its expert nodes and chooses features with the least volume of overlap region in such a way it leads to less depth trees without degradation in the classification accuracy. By selecting features with low volume of overlap region between classes, each neural network trains a feature space with low complexity, therefore it allows computing good boundaries between classes and splitting the samples more confidently. In addition, it leads to child nodes with less overlap volume too. The assignment of the features to the most appropriate neural network based on the data complexity of the local training set (LTS) enhances the classification performance. Finally, when NTEN faces with over-fitting, the expert node clusters the features as a solution for over-fitting (Brownlee, 2016, Cano, 2013, Castro et al., 2005, Cestnik, 1987, Chen and Hung, 2009, Coello and Lechuga, 2002, Cong et al., 2017, Deb et al., 2002, Demšar, 2006, Detrano et al., 1989, Diaconis and Efron, 1983, Evett and Ernest, 1987, Federer and Zylberberg, 2018, Fisher, 1936, Fonollosa et al., 2016, Fonollosa et al., 2014, Fontenla-Romero et al., 2010, Foresti and Dolso, 2004, Foresti and Micheloni, 2002, Foresti and Pieroni, 1998, Frey and Slate, 1991, Gorman and Sejnowski, 1988, Guvenir et al., 1997, Güvenir et al., 1998, Guyon and Elisseeff, 2003, Guyon et al., 2008, Hady et al., 2010, Hall et al., 2009, Ho and Basu, 2002, Ho et al., 2006, Hong and Yang, 1991, Horton and Nakai, 1996, Leondes, 2001, Lipowski and Lipowska, 2012, Ma et al., 2009, Maji, 2008, Marler and Arora, 2004, Mesejo et al., 2016, Michalski et al., 1986, Micheloni et al., 2012, Morán-Fernández et al., 2017, Noordewier et al., 1991, Ojha et al., 2017, Peng et al., 2005, Penrose, 1946, Rani et al., 2015, Rani et al., 2013, Robert, 2014, Saettler et al., 2017). Since redundancy could cause over-fitting in high-dimensional data (Cong et al., 2017), it is also minimized in the clusters.

This paper is organized as follows. Section 2 describes the related works. In Section 3, the NTEN model is presented. Section 4 introduces expert nodes of the NTEN. Theoretical aspects are mentioned in Section 5. Experimental results on high dimensional data are presented in Section 6. Section 7 concludes the paper.

Section snippets

Related works

A neural tree is a decision tree whose non-terminal nodes contain neural network which, recursively, trains and partitions the feature space of the LTS and splits it for the next child node (Foresti and Micheloni, 2002, Foresti and Pieroni, 1998, Micheloni et al., 2012). Different neural trees are used to solve plenty types of problems. The first neural tree, called perceptron tree (Utgoff, 1989), checks whether or not the simple perceptron can separate the LTS. When it cannot, it is replaced

Neural Tree with Expert Nodes (NTEN)

NTEN has a tree structure whose internal nodes contain expert systems that can decide which kind of neural network is eligible to classify the sub-problem associated to the LTS. Then, the eligible classifier trains the samples. When over-fitting is low, the samples, which are classified correctly with a class label with the best accuracy, are assigned to the left child node. The others are sent to the right child which is a new expert node for next classification phases. If the over-fitting of

Expert nodes of the NTEN

In previous section the training and testing procedure of NTEN is discussed, which each node in NTEN is an expert node. The architecture of any expert node of NTEN is presented in Fig. 3. As one can see, such expert node includes three subsystems.

•
The first subsystem consists of three components: a model-base, a knowledge-base and an inference engine. The inference engine gets a classification sub-problem, evaluates its data complexity and, by the aid of the knowledge-base, selects an eligible

Theoretical analysis

Lemma 1

Finding $K$ clusters of features by minimizing objective function of Eq. (3) is finite.

Proof

Let each feature consists of at most $θ$ discrete different values. Thus the redundancy part in objective function $F i t n e s s (q)$ can be evaluated in $O (θ^{2})$ . Also based on Eq. (2) VOR part can be evaluated by $O (M C^{2} l o g_{2} (N))$ , where $M$ , $C$ and $N$ are the numbers of features, classes and samples, respectively. Thus, the objective function $F i t n e s s (q)$ can be evaluated at most $O (m a x {M C^{2} l o g_{2} (N), θ^{2}})$ times and thus the

Experimental results

To show that NTEN is able to achieve accurate results compared with existing NTs, but with less depth trees and without over-fitting and degrading of the classification accuracy, different experiments have been done on 35 different data, which are mentioned in Table 2.

Note that in these experiments, the FDRK is another version of NTEN that applies feature clustering in the root node instead of expert nodes with over-fitting (Abpeykar & Ghatee, 2018). Comparative results of COF-NT (Rani et al.,

Conclusions

Classification of high-dimensional data suffers from curse of dimensionality and over-fitting. Feature selection is one of the solutions to this problem. On the other hand, neural trees are powerful classifiers that, by combining a local feature selection and a recursive partitioning, can solve simple sub-problems in each node. Respecting all achievements of existing neural trees, there is not a neural tree with multi-dimensional split. Also there is not neural tree with an expert system in the

References (73)

AbpeykarS. et al.
Neural trees with peer-to-peer and server-to-client knowledge transferring models for high-dimensional data classification
Expert Systems with Applications
(2019)
AbpeykarS. et al.
Ensemble decision forest of RBF networks via hybrid feature clustering approach for high-dimensional data classification
Computational Statistics & Data Analysis
(2019)
AltınçayH.
Decision trees using model ensemble-based nodes
Pattern Recognition
(2007)
AmmarM. et al.
Multi-agent architecture for Multi-objective optimization of Flexible Neural Tree
Neurocomputing
(2016)
CanoJ.-R.
Analysis of data complexity measures for classification
Expert Systems with Applications
(2013)
CastroJ. et al.
Data-partitioning using the Hilbert space filling curves: Effect on the speed of convergence of Fuzzy ARTMAP for large database problems
Neural Networks
(2005)
ChenY.-L. et al.
Using decision trees to summarize associative classification rules
Expert Systems with Applications
(2009)
DetranoR. et al.
International application of a new probability algorithm for the diagnosis of coronary artery disease
The American Journal of Cardiology
(1989)
FedererC. et al.
A self-organizing short-term dynamical memory network
Neural Networks
(2018)
FonollosaJ. et al.
Calibration transfer and drift counteraction in chemical sensor arrays using Direct Standardization
Sensors and Actuators B (Chemical)
(2016)

Fontenla-RomeroO. et al.

A new convex objective function for the supervised learning of single-layer neural networks

Pattern Recognition

(2010)

GormanR.P. et al.

Analysis of hidden units in a layered network trained to classify sonar targets

Neural Networks

(1988)

GüvenirH.A. et al.

Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals

Artificial Intelligence in Medicine

(1998)

HongZ.-Q. et al.

Optimal discriminant plane for a small number of samples and design method of classifier on the plane

Pattern Recognition

(1991)

LipowskiA. et al.

Roulette-wheel selection via stochastic acceptance

Physica A. Statistical Mechanics and its Applications

(2012)

MajiP.

Efficient design of neural network tree using a new splitting criterion

Neurocomputing

(2008)

MicheloniC. et al.

A balanced neural tree for pattern classification

Neural Networks

(2012)

OjhaV.K. et al.

Ensemble of heterogeneous flexible neural trees using multiobjective genetic programming

Applied Soft Computing

(2017)

RaniA. et al.

A neural tree for classification using convex objective function

Pattern Recognition Letters

(2015)

RaniA. et al.

Incorporating linear discriminant analysis in neural tree for multidimensional splitting

Applied Soft Computing

(2013)

SaettlerA. et al.

Decision tree classification with bounded number of errors

Information Processing Letters

(2017)

ShiY. et al.

Learning from label proportions on high-dimensional data

Neural Networks

(2018)

Silva FilhoT.M. et al.

A swarm-trained k-nearest prototypes adaptive classifier with automatic feature selection for interval data

Neural Networks

(2016)

YijingL. et al.

Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data

Knowledge-Based Systems

(2016)

ZiyatdinovA. et al.

Bioinspired early detection through gas flow modulation in chemo-sensory systems

Sensors and Actuators B (Chemical)

(2015)

AbpeykarS. et al.

An ensemble of RBF neural networks in decision tree structure with knowledge transferring to accelerate multi-classification

Neural Computing and Applications

(2018)

AeberhardS. et al.

Comparison of classifiers in high dimensional settingsTech. rep, 92(02)

(1992)

AsuncionA. et al.

UCI machine learning repository

(2007)

BlowerP.E. et al.

MicroRNA expression profiles for the NCI-60 cancer cell panel

Molecular Cancer Therapeutics

(2007)

BreimanL.

Classification and regression trees

(2017)

BrownleeJ.

Master Machine Learning Algorithms: Discover how they work and implement them from scratch

(2016)

CestnikB.

Assistant 86: A knowledge-elicitation tool for sophisticated users

Progress in Machine Learning

(1987)

CoelloC.C. et al.

MOPSO: A proposal for multiple objective particle swarm optimization

CongY. et al.

Online similarity learning for big data with overfitting

IEEE Transactions on Big Data

(2017)

DebK. et al.

A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Transactions on Evolutionary Computation

(2002)

DemšarJ.

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research (JMLR)

(2006)

Cited by (13)

Classification for high-dimension low-sample size data
2022, Pattern Recognition
High-dimension and low-sample-size (HDLSS) data sets have posed great challenges to many machine learning methods. To deal with practical HDLSS problems, development of new classification techniques is highly desired. After the cause of the over-fitting phenomenon is identified, a new classification criterion for HDLSS data sets, termed tolerance similarity, is proposed to emphasize maximization of within-class variance on the premise of class separability. Leveraging on this criterion, a novel linear binary classifier, termed No-separated Data Maximum Dispersion classifier (NPDMD), is designed. The main idea of the NPDMD is to spread samples of two classes in a large interval in the respective positive or negative space along the projecting direction when the distance between the projection means for two classes is large enough. The salient features of the proposed NPDMD are: (1) The NPDMD operates well on HDLSS data sets; (2) The NPDMD solves the objective function in the entire feature space to avoid the data-piling phenomenon. (3) The NPDMD leverages on the low-rank property of the covariance matrix for HDLSS data sets to accelerate the computation speed. (4) The NPDMD is suitable for different real-word applications. (5) The NPDMD can be implemented readily using Quadratic Programming. Not only theoretical properties of the NPDMD have been derived, but also a series of evaluations have been conducted on one simulated and six real-world benchmark data sets, including face classification and mRNA classification. Experimental results and comprehensive studies demonstrate the superiority of the NPDMD in terms of correct classification rate, mean within-group correct classification rate and the area under the ROC curve.
Thermal conductivity prediction of nano enhanced phase change materials: A comparative machine learning approach
2022, Journal of Energy Storage
Citation Excerpt :
One of the methods which can be used to prevent overfitting problem is to use unseen data (validation) to check ability of the in-hand network in prediction of these data. In addition, nonlinear activation function can be used in the model to prepare a nonlinear ANN model which decrease the chance of overfitting [133]. Then, 911 samples were collected and processes using above procedure.
Thermal conductivity is one of the crucial properties of nano enhanced phase change materials (NEPCM). Then, in this study three different machine learning methods namely MARS (Multivariate Adaptive Regression Spline), CART (Classification and Regression Tree) and ANN (Artificial Neural Network) is applied to estimate the thermal conductivity of NEPCMs. To develop these models, the information of the different types of NEPCM were collected from 25 studies. The nano particle includes CNF, h-BN, CBNP, GNP, MWCNT, TiO₂, SiC, GO, CuO, ZrO₂, EG and the PCMs were Paraffin, Polyethylene glycol, Dimethylformamide, Myristic acid, High Density Polyethylene, Phenol, Stearic acid, Erythritol, Eicosane, Palmitic acid and n-octadecane. The total number of samples were more than 911 data to train, test and validate the models. The input parameters for the model were thermal conductivity of nano particle and PCM (W/m.K), phase of NEPCM (solid or liquid), temperature of NEPCM ( °C) and concentration of nano material (wt%) and the output of the models was the thermal conductivity of the NEPCM (W/m.K). The results of the study showed the thermal conductivity of PCM is main effective parameter on the thermal conductivity prediction of NEPCM in all three models. Moreover, the accuracy of the predicted values by ANN model has shown the ability of the ANN to find relationship between dependent and independent variables complex problem and R² for MARS, CART and ANN model were 0.93, 0.93 and 0.96, respectively. Furthermore, the phase of the NEPCM, which has been used first time uniquely as a predictor in this study, has been second important variable in the developed models. Then, the developed ANN model in this study, as the first general ANN model for prediction of the thermal conductivity of NEPCM for carbon, metal and metal oxide nano particles combined with different types of PCM, can be used to estimate thermal conductivity of various type of NEPCM.
Least auxiliary loss-functions with impact growth adaptation (Laliga) for convolutional neural networks
2021, Neurocomputing
Citation Excerpt :
Therefore, it is important to adjust them dynamically along the training process [2]. An expert system embedded in the learning system can adjust their effects gradually [1]. Some modern regularization terms divide the network weights into some groups to prune unnecessary neurons [32].
Model selection is a challenge, and a popular Convolutional Neural Networks (CNN) usually takes extra-need parameters. It causes overfitting in real applications. Besides, the extracted hidden features would be lost when the number of convolution layers increases. We use the least auxiliary loss-functions to solve both of these problems. To this end, an optimization problem is stated to select a set of layers with the highest contributions in the training process. Also, an impact growth adaptation procedure adjusts the weights of losses. The constructed Least Auxiliary Loss-functions with Impact Growth Adaptation (Laliga) is a professional forum to select the best settings of auxiliary loss functions for CNNs training. Laliga memorizes the hidden features carefully and better represents the space by using non-redundant and more relevant features. Also, it uses singular value decomposition to regularize the weights. The theoretical results show that Laliga decreases overfitting substantially. Although this algorithm is useful for all CNN models, its results are auspicious for Visual Geometry Group (VGG) networks. The testing accuracies of Laliga for different VGG models on MNIST, CIFAR-10, and CIFAR-100 datasets are $99.7 %$ , $92.3 %$ , and $73.4 %$ , indicating Laliga overcomes many regularization methods in the dropout family. Besides, on more complicated datasets Caltech-101 and Caltech-256, its accuracies raise than $66.1 %$ and $33.2 %$ , which are better than dropout and close to Adaptive Spectral Regularization (ASR) results, although Laliga converges rapidly than ASR. Finally, we analyze the results of Laliga in a transportation case study.
Theory of adaptive SVD regularization for deep neural networks
2020, Neural Networks
Citation Excerpt :
In Abpeykar, Ghatee, and Zare (2019) the overfitting is solved in the training process by removing redundant features by an optimization approach. In Abpeikar et al. (2020) an expert system has been exploited in the learning model to justify the model with respect to the data complexity. But, these techniques did not justify the regularization scheme in the training process.
Deep networks can learn complex problems, however, they suffer from overfitting. To solve this problem, regularization methods have been proposed that are not adaptable to the dynamic changes in the training process. With a different approach, this paper presents a regularization method based on the Singular Value Decomposition (SVD) that adjusts the learning model adaptively. To this end, the overfitting can be evaluated by condition numbers of the synaptic matrices. When the overfitting is high, the matrices are substituted with their SVD approximations. Some theoretical results are derived to show the performance of this regularization method. It is proved that SVD approximation cannot solve overfitting after several iterations. Thus, a new Tikhonov term is added to the loss function to converge the synaptic weights to the SVD approximation of the best-found results. Following this approach, an Adaptive SVD Regularization (ASR) is proposed to adjust the learning model with respect to the dynamic training characteristics. ASR results are visualized to show how ASR overcomes overfitting. The different configurations of Convolutional Neural Networks (CNN) are implemented with different augmentation schemes to compare ASR with state-of-the-art regularization methods. The results show that on MNIST, F-MNIST, SVHN, CIFAR-10 and CIFAR-100, the accuracies of ASR are 99.4%, 95.7%, 97.1%, 93.2% and 55.6%, respectively. Although ASR improves the overfitting and validation loss, its elapsed time is not significantly greater than the learning without regularization.
Deep Multilayer Perceptron Neural Network for the Prediction of Iranian Dam Project Delay Risks
2023, Journal of Construction Engineering and Management
Automatic Identification and Geo-Validation of Event-Related Images for Emergency Management
2023, Information (Switzerland)

View all citing articles on Scopus

View full text

Adaptive neural tree exploiting expert nodes to classify high-dimensional data

Abstract

Introduction

Section snippets

Related works

Neural Tree with Expert Nodes (NTEN)

Expert nodes of the NTEN

Theoretical analysis

Experimental results

Conclusions

Expert Systems with Applications

Computational Statistics & Data Analysis

Pattern Recognition

Neurocomputing

Expert Systems with Applications

Neural Networks

Expert Systems with Applications

The American Journal of Cardiology

Neural Networks

Sensors and Actuators B (Chemical)

Pattern Recognition

Neural Networks

Artificial Intelligence in Medicine

Pattern Recognition

Physica A. Statistical Mechanics and its Applications

Neurocomputing

Neural Networks

Applied Soft Computing

Pattern Recognition Letters

Applied Soft Computing

Information Processing Letters

Neural Networks

Neural Networks

Knowledge-Based Systems

Sensors and Actuators B (Chemical)

An ensemble of RBF neural networks in decision tree structure with knowledge transferring to accelerate multi-classification

Neural Computing and Applications

Comparison of classifiers in high dimensional settingsTech. rep, 92(02)

UCI machine learning repository

MicroRNA expression profiles for the NCI-60 cancer cell panel

Molecular Cancer Therapeutics

Classification and regression trees

Master Machine Learning Algorithms: Discover how they work and implement them from scratch

Assistant 86: A knowledge-elicitation tool for sophisticated users

Progress in Machine Learning

MOPSO: A proposal for multiple objective particle swarm optimization

Online similarity learning for big data with overfitting

IEEE Transactions on Big Data

A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Transactions on Evolutionary Computation

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research (JMLR)