A novel mixture of experts model based on cooperative coevolution
Introduction
Evolutionary artificial neural networks (EANNs) have been widely studied in the last few decades. The main power of artificial neural networks (ANNs) lies in their ability to correctly learn the underlying function or distribution in a data set from a sample. This ability is called generalization. Mathematically, the generalization ability can be expressed in terms of minimizing the recognition error of the neural network, on previously unseen data. Thus evolutionary computation (EC), a global optimization approach, can be employed to optimize this error function. As discussed in the prominent review of Yao [31], evolutionary methods can be applied on different levels of ANN, such as the architecture and the connection weights.
Much of the ANN literature concentrates on finding a single solution (network) to learn a task. However, an optimum network on the training data (i.e. seen data) may not generalize well on the testing data (i.e. unseen data). An ANN could either overtrain/overfit (memorizing the data rather than learning the correct distribution) or undertrain (not trained enough, or too simple, to fit the data well) (see [7] on the bias/variance dilemma and the generalization problem). Many published works have shown that an ensemble of neural networks (i.e. neuro-ensemble) can generalize better than individual networks [10], [16], [17], [18], [23], [25], [32], [33], [34]. The main argument for neuro-ensembles is that different members of the ensemble may possess different bias/variance trade-offs, hence a suitable combination of these biases/variances could result in an improvement in the generalization ability of the whole ensemble [25], [34]. It is obvious that a self-similar set of individuals is not desirable, since it multiplies the effort to train them without adding to the overall performance—i.e. the system's performance is similar to that of a single network.
One important application of neuro-ensembles is in problem decomposition. Most real-world problems are too complicated for a single individual to solve. Divide-and-conquer has proved to be efficient in many of these complex situations. The issues are (i) how to divide the problem into simpler tasks, (ii) how to assign individuals to solve these subtasks and (iii) how to synthesize the whole system back together. If the problem has a distinct natural decomposition, it would be possible to derive such a decomposition by hand. However, in most real-world problems, we either know too little about the problem, or it is too complex for us to have a clear understanding on how to hand-decompose it into subproblems. Thus, it is desirable to have a method to automatically decompose a complex problem into a set of overlapping or disjoint subproblems, and to assign one or more specialized problem solving tools or experts to each of these subproblems. The remaining question is how to combine the outputs of these experts if the decomposition scheme is unknown in advance.
Jacobs [8], [9] has proposed an ensemble method called mixture of experts (ME), based on the divide-and-conquer principle. In their method, instead of assigning a set of combinational weights to the experts, an extra gating component is used to compute these weights dynamically from the inputs (Fig. 1). This gating component is trained, together with other experts, through a specially tailored error function, which localizes the experts into different subsets of the data while improving the system's performance. In the ME model, the expert could be of any type, e.g. an ANN or a C4.5 decision tree, but the gating is often an ANN. Jordan and Jacobs [11], [12] extended the model to the so-called hierarchical mixture of experts (HME), in which each component of the ME model is replaced with an ME model. Since Jacobs’ proposal of the ME model in 1991, there has been a wide range of research into it.
Some authors [1], [13], [14] have established how the ME model works in statistical terms. Waterhouse [28], [30] and Moeland [19] have applied the Bayesian framework to design and explain the ME model. According to their interpretation, the ME output(s) can be considered as estimates of the posterior probabilities of class membership [19]. Thus, the Bayesian framework can be used to design the training error function [3] and estimate the parameters for the ME model [30]. Besides the original ME model, a large number of variants have been put forward. Waterhouse and Cook [29] and Avnimelech and Cook [2] proposed to combine ME with the boosting algorithm. They argued that, since boosting encourages classifiers to be experts on different patterns that previous experts disagree on, it can split the data set into regions for the experts in the ME model, and thus ensure localization of experts. The dynamic gating function of the ME ensures a good combination of classifiers [2]. Tang et al. [26] tried to explicitly localize the experts by applying a self-organizing map to partition the input space for the experts. Wan and Bone [27] used a mixture of radial basis function networks to partition the input space into statistically correlated regions and learn the local covariation model of the data in each region.
Although gradient descent is the most popular ANN training method, especially in industrial problems, because of its simple implementation and its efficiency, it has some serious drawbacks. The growing literature on EC research as a global optimization method led to a number of successful attempts to evolve ANNs [31]. A newer branch of EC called the cooperative coevolutionary (CC) algorithm was proposed by Potter and De Jong [21], [22].
Garcia-Pedrajas et al. [6] have applied multiobjective optimization in conjunction with CC, to evolve a set of subpopulations of well-performed, regularized, cooperative and diverse ANNs which can be used in a set of ensembles. Khare et al. [15] used the concept of CC on a set of subpopulations of radial basis function networks, where each subpopulation is designed to solve a particular subtask of the whole problem. A second level, consisting of a swarm of ensembles, to combine selected ANNs from the subpopulation, is also evolved in parallel with these subpopulations. The two disadvantages of this method are (i) its requirement for a prior knowledge about the problem in order to know the fixed number of required modules and (ii) its dependence on credit assignment, in that the fitness of each module is decided by the contribution of the module to the whole system. To solve the problem of fixed number of modules, Khare et al. [15] suggested using Potter's approach of adding and removing subpopulations whenever the system's fitness stagnates for a predetermined period. Despite the remaining credit assignment problem, their method has the merit that both the structures and parameters, of both the modules and the whole ensemble, can be evolved within the framework.
Section snippets
Mixture of experts
The ME model consists of a number of experts combined through a gate (Fig. 1), all having access to the input space. The components can be any type of classifiers—in this paper, we use simple feed-forward multilayer neural networks. The output of a ME model is the weighted average of the individual experts’ outputs , with the weights , produced by the gate network:The gate output can be considered as the probability that expert m is
Cooperative coevolutionary mixture of experts (CCME)
In this paper, a novel method is introduced, combining the ME model with the CC mechanism. On the one hand, CC allows the system to incorporate both EC and backpropagation (BP); thus, ME enhances the learning capabilities of BP. The CC framework naturally supports problem decomposition, assisting the capabilities of ME. On the other hand, the ME model imposes an external diversity force, driving the components into different local regions, and thus ensuring diversity between the subpopulations.
Experiments
CCME is verified on 10 standard data sets taken from the UCI Machine Learning Repository as summarized in Table 1. These data sets were downloaded from ice.uci.edu [4].
In the experiments, we used 10-fold cross-validation for each data set. The data set is divided into 10 disjoint subsets using stratified sampling. Each of the subsets is taken in turn as the test set, making 10 trials in all. In each experiment, one of the remaining subsets is chosen at random as the validation set, while the
Conclusion and future work
In this paper, we have introduced a novel method based on the principles of both Cooperative Coevolution and Mixture of Experts. We have investigated different aspects of the proposed CCME model. The results of the experiments can be summarized as follows: (i) CCME is robust to varying ensemble complexity, in terms of the number of individual experts and (ii) CCME is robust to varying ANN complexity, in terms of the number of hidden units. When comparing CCME and ME, the results show that CCME
Acknowledgments
H. Abbass wishes to thank the ARC Centre for Complex Systems Grant number CEO0348249. The authors wish to thank the anonymous reviewers and the editor for their constructive comments.
Minh Ha Nguyen graduated (with first class honors) from the University of Canberra, Australia, in 1999. She is a PhD candidate of the University of New South Wales, Australia, in 2002–2006. Her work is focusing on EC, neural networks and data mining.
References (34)
- et al.
Ensemble learning via negative correlation
Neural Networks
(1999) - et al.
Stopping criteria for ensemble of evolutionary artificial neural networks
- et al.
Bounds for the average generalization error of the mixture of experts neural network
- et al.
Boosted mixture of experts: an ensemble learning scheme
Neural Comput.
(1999) Neural Networks for Pattern Recognition
(1995)- C. Blake, C. Merz, Uci repository of machine learning databases. 〈html://www.ics.uci.edu/mlearn/mlrepository.html〉,...
- et al.
Evolution, neural networks, games, and intelligence
Proc. IEEE
(1999) - et al.
Cooperative coevolution of artificial neural network ensembles for pattern recognition
IEEE Trans. Evol. Comput.
(2005) Neural Networks: A Comprehensive Foundation
(1999)- R. Jacobs, M. Jordan, A. Barto, Task decomposition through competition in a modular connectionist architecture: the...
Adaptive mixtures of local experts
Neural Comput.
Dynamically weighted ensemble neural networks for classification
Hierarchies of adaptive experts
Hierachical mixtures of experts and the EM algorithm
Neural Comput.
Statistical mechanics of the mixture of experts
Co-evolutionary modular neural networks for automatic problem decomposition
Cited by (31)
Nature-inspired optimal tuning of input membership functions of Takagi-Sugeno-Kang fuzzy models for Anti-lock Braking Systems
2015, Applied Soft ComputingCitation Excerpt :A variable structure systems approach is combined in Ref. [28] with the Levenberg–Marquardt algorithm to produce an optimization approach applied to the training of fuzzy inference systems and tested on a two degrees of freedom direct drive SCARA robotic manipulator. A cooperative coevolutionary algorithm is combined with a basic mixture of experts model in Ref. [29] and applied to classification problems showing a better exploration of the weight space. A framework for granular computing neural-fuzzy modeling structures based on neutrosophic logic is suggested in Ref. [30] and applied on real-world industrial data.
Mixture of feature specified experts
2014, Information FusionCitation Excerpt :In some face processing studies [39,40] the input space is divided among the experts based on the pose of face or facial expression. In [44], a method based on both cooperative coevolution and Mixture of Experts is introduced to decompose problems into different regions and assign the experts to these distinct regions. By considering all mentioned methods, it is observed that the boundaries between the subspaces are distinctive in all of these methods.
Embedded local feature selection within mixture of experts
2014, Information SciencesCitation Excerpt :This idea adds flexibility to model the local covariance of the data. Nguyen et al. [32] propose a variation to the classical MoE by using an evolutionary algorithm to learn the model. The overall model is an ensemble, where each component is a mixture of experts.
A multi-agent system for web-based risk management in small and medium business
2012, Expert Systems with ApplicationsCitation Excerpt :Nowadays mixture of experts is a technique used in several fields (Garvey & Lesser, 1993; Subasi, 2007). A mixture of experts provide advances capacities by fusing the outputs of various processes (experts) and obtain the response more suitable for the final value (Lima, Coelho, & Von Zuben, 2007; Nguyena, Abbassa, & Mckay, 2006). Mixtures of experts are also commonly used for classification and are usually called ensemble (Zhanga & Lu, 2010).
Fast training of neural trees by adaptive splitting based on cubature
2008, NeurocomputingA Survey on Cooperative Co-Evolutionary Algorithms
2019, IEEE Transactions on Evolutionary Computation
Minh Ha Nguyen graduated (with first class honors) from the University of Canberra, Australia, in 1999. She is a PhD candidate of the University of New South Wales, Australia, in 2002–2006. Her work is focusing on EC, neural networks and data mining.
Hussein A. Abbass is the director of the Artificial Life and Adaptive Robotics Laboratory, the School of Information Technology and Electrical Engineering, University of New South Wales at the Australian Defence Force Academy in Canberra, Australia. He is a senior member of the IEEE, the chair of IEEE working group on Artificial Life and Complex Adaptive Systems, technical cochair of SEAL 2006, the cochair of the First IEEE Symposium on Artificial Life, Honolulu 2007, technical cochair of IEEE-CEC 2007 and is/has been cochair and on the program committee of many conferences in the field. His work is focusing on EC, neural networks and complex systems.
Robert I. McKay graduated from the Australian National University in 1971 and was awarded his PhD in the theory of computation from the University of Bristol in 1976. He researched computer typesetting in the Commonwealth Scientific and Industrial Research Organization from 1976–1985, when he joined the University of New South Wales (Australian Defence Force Academy campus). He moved to Seoul National University, where he heads the Structural Complexity Laboratory (http://scsnu.ac.kr), in 2005.