A novel mixture of experts model based on cooperative coevolution

doi:10.1016/j.neucom.2006.04.009

Neurocomputing

Volume 70, Issues 1–3, December 2006, Pages 155-163

https://doi.org/10.1016/j.neucom.2006.04.009 Get rights and content

Abstract

Combining several suitable neural networks can enhance the generalization performance of the group when compared to a single network alone. However, it remains a largely open question, how best to build a suitable combination of individuals. Jacobs and his colleagues proposed the mixture of experts (ME) model, in which a set of neural networks are trained together with a gate network. This tight coupling mechanism enables the system to (i) encourage diversity between the individual neural networks by specializing them in different regions of the input space and (ii) allow for a “good” combination weights of the ensemble members to emerge by training the gate, which computes the dynamic weights together with the classifiers.

In this paper, we have wrapped a cooperative coevolutionary (CC) algorithm around the basic ME model. This CC layer allows better exploration of the weight space, and hence, an ensemble with better performance. The results show that CCME is better on average than the original ME on a number of classification problems. We have also introduced a novel mechanism for visualizing the modular structures that emerged from the model.

Introduction

Evolutionary artificial neural networks (EANNs) have been widely studied in the last few decades. The main power of artificial neural networks (ANNs) lies in their ability to correctly learn the underlying function or distribution in a data set from a sample. This ability is called generalization. Mathematically, the generalization ability can be expressed in terms of minimizing the recognition error of the neural network, on previously unseen data. Thus evolutionary computation (EC), a global optimization approach, can be employed to optimize this error function. As discussed in the prominent review of Yao [31], evolutionary methods can be applied on different levels of ANN, such as the architecture and the connection weights.

Much of the ANN literature concentrates on finding a single solution (network) to learn a task. However, an optimum network on the training data (i.e. seen data) may not generalize well on the testing data (i.e. unseen data). An ANN could either overtrain/overfit (memorizing the data rather than learning the correct distribution) or undertrain (not trained enough, or too simple, to fit the data well) (see [7] on the bias/variance dilemma and the generalization problem). Many published works have shown that an ensemble of neural networks (i.e. neuro-ensemble) can generalize better than individual networks [10], [16], [17], [18], [23], [25], [32], [33], [34]. The main argument for neuro-ensembles is that different members of the ensemble may possess different bias/variance trade-offs, hence a suitable combination of these biases/variances could result in an improvement in the generalization ability of the whole ensemble [25], [34]. It is obvious that a self-similar set of individuals is not desirable, since it multiplies the effort to train them without adding to the overall performance—i.e. the system's performance is similar to that of a single network.

One important application of neuro-ensembles is in problem decomposition. Most real-world problems are too complicated for a single individual to solve. Divide-and-conquer has proved to be efficient in many of these complex situations. The issues are (i) how to divide the problem into simpler tasks, (ii) how to assign individuals to solve these subtasks and (iii) how to synthesize the whole system back together. If the problem has a distinct natural decomposition, it would be possible to derive such a decomposition by hand. However, in most real-world problems, we either know too little about the problem, or it is too complex for us to have a clear understanding on how to hand-decompose it into subproblems. Thus, it is desirable to have a method to automatically decompose a complex problem into a set of overlapping or disjoint subproblems, and to assign one or more specialized problem solving tools or experts to each of these subproblems. The remaining question is how to combine the outputs of these experts if the decomposition scheme is unknown in advance.

Jacobs [8], [9] has proposed an ensemble method called mixture of experts (ME), based on the divide-and-conquer principle. In their method, instead of assigning a set of combinational weights to the experts, an extra gating component is used to compute these weights dynamically from the inputs (Fig. 1). This gating component is trained, together with other experts, through a specially tailored error function, which localizes the experts into different subsets of the data while improving the system's performance. In the ME model, the expert could be of any type, e.g. an ANN or a C4.5 decision tree, but the gating is often an ANN. Jordan and Jacobs [11], [12] extended the model to the so-called hierarchical mixture of experts (HME), in which each component of the ME model is replaced with an ME model. Since Jacobs’ proposal of the ME model in 1991, there has been a wide range of research into it.

Some authors [1], [13], [14] have established how the ME model works in statistical terms. Waterhouse [28], [30] and Moeland [19] have applied the Bayesian framework to design and explain the ME model. According to their interpretation, the ME output(s) can be considered as estimates of the posterior probabilities of class membership [19]. Thus, the Bayesian framework can be used to design the training error function [3] and estimate the parameters for the ME model [30]. Besides the original ME model, a large number of variants have been put forward. Waterhouse and Cook [29] and Avnimelech and Cook [2] proposed to combine ME with the boosting algorithm. They argued that, since boosting encourages classifiers to be experts on different patterns that previous experts disagree on, it can split the data set into regions for the experts in the ME model, and thus ensure localization of experts. The dynamic gating function of the ME ensures a good combination of classifiers [2]. Tang et al. [26] tried to explicitly localize the experts by applying a self-organizing map to partition the input space for the experts. Wan and Bone [27] used a mixture of radial basis function networks to partition the input space into statistically correlated regions and learn the local covariation model of the data in each region.

Although gradient descent is the most popular ANN training method, especially in industrial problems, because of its simple implementation and its efficiency, it has some serious drawbacks. The growing literature on EC research as a global optimization method led to a number of successful attempts to evolve ANNs [31]. A newer branch of EC called the cooperative coevolutionary (CC) algorithm was proposed by Potter and De Jong [21], [22].

Garcia-Pedrajas et al. [6] have applied multiobjective optimization in conjunction with CC, to evolve a set of subpopulations of well-performed, regularized, cooperative and diverse ANNs which can be used in a set of ensembles. Khare et al. [15] used the concept of CC on a set of subpopulations of radial basis function networks, where each subpopulation is designed to solve a particular subtask of the whole problem. A second level, consisting of a swarm of ensembles, to combine selected ANNs from the subpopulation, is also evolved in parallel with these subpopulations. The two disadvantages of this method are (i) its requirement for a prior knowledge about the problem in order to know the fixed number of required modules and (ii) its dependence on credit assignment, in that the fitness of each module is decided by the contribution of the module to the whole system. To solve the problem of fixed number of modules, Khare et al. [15] suggested using Potter's approach of adding and removing subpopulations whenever the system's fitness stagnates for a predetermined period. Despite the remaining credit assignment problem, their method has the merit that both the structures and parameters, of both the modules and the whole ensemble, can be evolved within the framework.

Section snippets

Mixture of experts

The ME model consists of a number of experts combined through a gate (Fig. 1), all having access to the input space. The components can be any type of classifiers—in this paper, we use simple feed-forward multilayer neural networks. The output $y (\overset{\Rightarrow}{x})$ of a ME model is the weighted average of the individual experts’ outputs $y_{m} (\overset{\Rightarrow}{x}), m = 1, \dots, M$ , with the weights $g_{m} (\overset{\Rightarrow}{x}), m = 1, \dots, M$ , produced by the gate network: $y (\overset{\Rightarrow}{x}) = \sum_{m = 1}^{M} g_{m} (\overset{\Rightarrow}{x}) y_{m} (\overset{\Rightarrow}{x}) .$ The gate output can be considered as the probability that expert m is

Cooperative coevolutionary mixture of experts (CCME)

In this paper, a novel method is introduced, combining the ME model with the CC mechanism. On the one hand, CC allows the system to incorporate both EC and backpropagation (BP); thus, ME enhances the learning capabilities of BP. The CC framework naturally supports problem decomposition, assisting the capabilities of ME. On the other hand, the ME model imposes an external diversity force, driving the components into different local regions, and thus ensuring diversity between the subpopulations.

Experiments

CCME is verified on 10 standard data sets taken from the UCI Machine Learning Repository as summarized in Table 1. These data sets were downloaded from ice.uci.edu [4].

In the experiments, we used 10-fold cross-validation for each data set. The data set is divided into 10 disjoint subsets using stratified sampling. Each of the subsets is taken in turn as the test set, making 10 trials in all. In each experiment, one of the remaining subsets is chosen at random as the validation set, while the

Conclusion and future work

In this paper, we have introduced a novel method based on the principles of both Cooperative Coevolution and Mixture of Experts. We have investigated different aspects of the proposed CCME model. The results of the experiments can be summarized as follows: (i) CCME is robust to varying ensemble complexity, in terms of the number of individual experts and (ii) CCME is robust to varying ANN complexity, in terms of the number of hidden units. When comparing CCME and ME, the results show that CCME

Acknowledgments

H. Abbass wishes to thank the ARC Centre for Complex Systems Grant number CEO0348249. The authors wish to thank the anonymous reviewers and the editor for their constructive comments.

Minh Ha Nguyen graduated (with first class honors) from the University of Canberra, Australia, in 1999. She is a PhD candidate of the University of New South Wales, Australia, in 2002–2006. Her work is focusing on EC, neural networks and data mining.

References (34)

Y. Liu et al.
Ensemble learning via negative correlation
Neural Networks
(1999)
M.H. Nguyen et al.
Stopping criteria for ensemble of evolutionary artificial neural networks
L. Alexandre et al.
Bounds for the average generalization error of the mixture of experts neural network
R. Avnimelech et al.
Boosted mixture of experts: an ensemble learning scheme
Neural Comput.
(1999)
C.M. Bishop
Neural Networks for Pattern Recognition
(1995)
C. Blake, C. Merz, Uci repository of machine learning databases. 〈html://www.ics.uci.edu/mlearn/mlrepository.html〉,...
K. Chellapilla et al.
Evolution, neural networks, games, and intelligence
Proc. IEEE
(1999)
N. Garcia-Pedrajas et al.
Cooperative coevolution of artificial neural network ensembles for pattern recognition
IEEE Trans. Evol. Comput.
(2005)
S. Haykin
Neural Networks: A Comprehensive Foundation
(1999)
R. Jacobs, M. Jordan, A. Barto, Task decomposition through competition in a modular connectionist architecture: the...

R. Jacobs et al.

Adaptive mixtures of local experts

Neural Comput.

(1991)

D. Jimenez et al.

Dynamically weighted ensemble neural networks for classification

M. Jordan et al.

Hierarchies of adaptive experts

M. Jordan et al.

Hierachical mixtures of experts and the EM algorithm

Neural Comput.

(1994)

M.I. Jordan, L. Xu, Convergence results for the EM approach to mixtures of experts architectures, Artificial...

K. Kang et al.

Statistical mechanics of the mixture of experts

V. Khare et al.

Co-evolutionary modular neural networks for automatic problem decomposition

Cited by (31)

Nature-inspired optimal tuning of input membership functions of Takagi-Sugeno-Kang fuzzy models for Anti-lock Braking Systems
2015, Applied Soft Computing
Citation Excerpt :
A variable structure systems approach is combined in Ref. [28] with the Levenberg–Marquardt algorithm to produce an optimization approach applied to the training of fuzzy inference systems and tested on a two degrees of freedom direct drive SCARA robotic manipulator. A cooperative coevolutionary algorithm is combined with a basic mixture of experts model in Ref. [29] and applied to classification problems showing a better exploration of the weight space. A framework for granular computing neural-fuzzy modeling structures based on neutrosophic logic is suggested in Ref. [30] and applied on real-world industrial data.
This paper suggests a synergy of fuzzy logic and nature-inspired optimization in terms of the nature-inspired optimal tuning of the input membership functions of a class of Takagi-Sugeno-Kang (TSK) fuzzy models dedicated to Anti-lock Braking Systems (ABSs). A set of TSK fuzzy models is proposed by a novel fuzzy modeling approach for ABSs. The fuzzy modeling approach starts with the derivation of a set of local state-space models of the nonlinear ABS process by the linearization of the first-principle process model at ten operating points. The TSK fuzzy model structure and the initial TSK fuzzy models are obtained by the modal equivalence principle in terms of placing the local state-space models in the rule consequents of the TSK fuzzy models. An operating point selection algorithm to guide modeling is proposed, formulated on the basis of ranking the operating points according to their importance factors, and inserted in the third step of the fuzzy modeling approach. The optimization problems are defined such that to minimize the objective functions expressed as the average of squared modeling errors over the time horizon, and the variables of these functions are a part of the parameters of the input membership functions. Two representative nature-inspired algorithms, namely a Simulated Annealing (SA) algorithm and a Particle Swarm Optimization (PSO) algorithm, are implemented to solve the optimization problems and to obtain optimal TSK fuzzy models. The validation and the comparison of SA and PSO and of the new TSK fuzzy models are carried out for an ABS laboratory equipment. The real-time experimental results highlight that the optimized TSK fuzzy models are simple and consistent with both training data and validation data and that these models outperform the initial TSK fuzzy models.
Mixture of feature specified experts
2014, Information Fusion
Citation Excerpt :
In some face processing studies [39,40] the input space is divided among the experts based on the pose of face or facial expression. In [44], a method based on both cooperative coevolution and Mixture of Experts is introduced to decompose problems into different regions and assign the experts to these distinct regions. By considering all mentioned methods, it is observed that the boundaries between the subspaces are distinctive in all of these methods.
Mixture of Experts is one of the most popular ensemble methods in pattern recognition systems. Although, diversity between the experts is one of the necessary conditions for the success of combining methods, ensemble systems based on Mixture of Experts suffer from the lack of enough diversity among the experts caused by unfavorable initial parameters. In the conventional Mixture of Experts, each expert receives the whole feature space. To increase diversity among the experts, solve the structural issues of Mixture of Experts such as zero coefficient problem, and improve efficiency in the system, we intend to propose a model, entitled Mixture of Feature Specified Experts, in which each expert gets a different subset of the original feature set. To this end, we first select a set of feature subsets which lead to a set of diverse and efficient classifiers. Then the initial parameters are infused to the system with training classifiers on the selected feature subsets. Finally, we train the expert and the gating networks using the learning rule of classical Mixture of Experts to organize collaboration between the members of system and aiding the gating network to find the best partitioning of the problem space. To evaluate our proposed method, we have used six datasets from the UCI repository. In addition the generalization capability of our proposed method is considered on real-world database of EEG based Brain-Computer Interface. The performance of our method is evaluated with various appraisal criteria and significant improvement in recognition rate of our proposed method is indicated in all practical tests.
Embedded local feature selection within mixture of experts
2014, Information Sciences
Citation Excerpt :
This idea adds flexibility to model the local covariance of the data. Nguyen et al. [32] propose a variation to the classical MoE by using an evolutionary algorithm to learn the model. The overall model is an ensemble, where each component is a mixture of experts.
A useful strategy to deal with complex classification scenarios is the “divide and conquer” approach. The mixture of experts (MoE) technique makes use of this strategy by jointly training a set of classifiers, or experts, that are specialized in different regions of the input space. A global model, or gate function, complements the experts by learning a function that weighs their relevance in different parts of the input space. Local feature selection appears as an attractive alternative to improve the specialization of experts and gate function, particularly, in the case of high dimensional data. In general, subsets of dimensions, or subspaces, are usually more appropriate to classify instances located in different regions of the input space. Accordingly, this work contributes with a regularized variant of MoE that incorporates an embedded process for local feature selection using $L_{1}$ regularization. Experiments using artificial and real-world datasets provide evidence that the proposed method improves the classical MoE technique, in terms of accuracy and sparseness of the solution. Furthermore, our results indicate that the advantages of the proposed technique increase with the dimensionality of the data.
A multi-agent system for web-based risk management in small and medium business
2012, Expert Systems with Applications
Citation Excerpt :
Nowadays mixture of experts is a technique used in several fields (Garvey & Lesser, 1993; Subasi, 2007). A mixture of experts provide advances capacities by fusing the outputs of various processes (experts) and obtain the response more suitable for the final value (Lima, Coelho, & Von Zuben, 2007; Nguyena, Abbassa, & Mckay, 2006). Mixtures of experts are also commonly used for classification and are usually called ensemble (Zhanga & Lu, 2010).
Business Intelligence has gained relevance during the last years to improve business decision making. However, there is still a growing need of developing innovative tools that can help small to medium sized enterprises to predict risky situations and manage inefficient activities. This article present a multi-agent system especially created to detect risky situations and provide recommendations to the internal auditors of SMEs. The core of the multi-agent system is a type of agent with advanced capacities for reasoning to make predictions based on previous experiences. This agent type is used to implement a evaluator agent specialized in detect risky situations and an advisor agent aimed at providing decision support facilities. Both agents incorporate innovative techniques in the stages of the CBR system. An initial prototype was developed and the results obtained related to small and medium enterprises in a real scenario are presented.
Fast training of neural trees by adaptive splitting based on cubature
2008, Neurocomputing
In this paper we prove that any affine function defined on a d-simplex in $R^{d}$ can be uniformly approximated by a single-layer neural network having only two neurons irrespective of d. The weights of this network are obtained in a closed analytical form, without training. This fact gives a correspondence rule that allows to transform mathematical approximants based on piecewise affine functions, into neural networks. We introduce such an approximant, adaptive splitting based on cubature (ASBC), for the efficient approximation of continuous functions. Using ASBC and the above correspondence rule, we obtain a neural tree. Numerical experiments on learning the function distance from a variable point to a geometric body in two and three dimensions show fast learning speed and high accuracy when compared with single-hidden layer feedforward networks trained by a trust region method based on the interior-reflective Newton algorithm.
A Survey on Cooperative Co-Evolutionary Algorithms
2019, IEEE Transactions on Evolutionary Computation

View all citing articles on Scopus

Hussein A. Abbass is the director of the Artificial Life and Adaptive Robotics Laboratory, the School of Information Technology and Electrical Engineering, University of New South Wales at the Australian Defence Force Academy in Canberra, Australia. He is a senior member of the IEEE, the chair of IEEE working group on Artificial Life and Complex Adaptive Systems, technical cochair of SEAL 2006, the cochair of the First IEEE Symposium on Artificial Life, Honolulu 2007, technical cochair of IEEE-CEC 2007 and is/has been cochair and on the program committee of many conferences in the field. His work is focusing on EC, neural networks and complex systems.

Robert I. McKay graduated from the Australian National University in 1971 and was awarded his PhD in the theory of computation from the University of Bristol in 1976. He researched computer typesetting in the Commonwealth Scientific and Industrial Research Organization from 1976–1985, when he joined the University of New South Wales (Australian Defence Force Academy campus). He moved to Seoul National University, where he heads the Structural Complexity Laboratory (http://scsnu.ac.kr), in 2005.

View full text

A novel mixture of experts model based on cooperative coevolution

Abstract

Introduction

Section snippets

Mixture of experts

Cooperative coevolutionary mixture of experts (CCME)

Experiments

Conclusion and future work

Acknowledgments

Neural Networks

Bounds for the average generalization error of the mixture of experts neural network

Boosted mixture of experts: an ensemble learning scheme

Neural Comput.

Neural Networks for Pattern Recognition

Evolution, neural networks, games, and intelligence

Proc. IEEE

Cooperative coevolution of artificial neural network ensembles for pattern recognition

IEEE Trans. Evol. Comput.

Neural Networks: A Comprehensive Foundation

Adaptive mixtures of local experts

Neural Comput.

Dynamically weighted ensemble neural networks for classification

Hierarchies of adaptive experts

Hierachical mixtures of experts and the EM algorithm

Neural Comput.

Statistical mechanics of the mixture of experts

Co-evolutionary modular neural networks for automatic problem decomposition