Distributed learning for Random Vector Functional-Link networks
Introduction
Over the past decades, supervised learning techniques have been well developed with theoretical analyses and empirical studies [15]. The ICT world, however, is being rapidly reshaped by emerging trends such as big data [22], pervasive computing [33], commodity computing [11], Internet of things [2], and several others. All these frameworks have a similar common theme underlying them: computing power is now a widespread feature surrounding us, and the same can be said about data. Consequently, supervised learning is expected to face major technological and theoretical challenges, since in many situations the overall training data cannot be assumed to lie at a single location, nor is it realistic to have a centralized authority for collecting and processing it. The previous trends also put forth the challenge of analyzing structured and heterogeneous data [4], however, we are not concerned with this issue in this paper.
As a prototypical example, consider solving a music classification task (e.g., genre classification [35]) over a peer-to-peer (P2P) network of computers, each node possessing its own labeled database of songs. It is reasonable to assume that, to obtain good performance, no single database may be sufficient, and there is the need of leveraging over the data of all users. However, in a P2P network no centralized authority exists, hence the nodes need a distributed training protocol to solve the classification task. In fact, it is known that a fully decentralized training algorithm can be useful even in situations where having a master node is technologically feasible [16]. In particular, such a distributed algorithm would remove the risks of having a single point of failure, or a communication bottleneck towards the central node. Similar situations are also widespread in Wireless Sensor Networks (WSN), where additional power concerns arise [5]. Finally, it may happen that data simply cannot be moved across the network: either for being large (in term of number of examples or dimensionality of each pattern), or because fundamental privacy concerns are present [40]. The general setting, which we will call ‘data-distributed learning’, is graphically depicted in Fig. 1.
So far, a large body of research has gone into developing fully distributed, decentralized learning algorithms, including works on diffusion adaptation [21], [34], learning by consensus [16], distributed learning on commodity clusters architectures [8], adaptation on WSNs [5], [32], distributed online learning [13], distributed optimization [6], [14], [18], [38], ad-hoc learning algorithms for specific architectures [12], [26], distributed databases [20], and others. Despite this, many important research questions remain open [31], and in particular several well-known learning models, originally formulated in the centralized setting, have not yet been generalized to the fully decentralized setting.
In this paper, we propose two distributed learning algorithms for a yet-unexplored model, that is Random Vector Functional-Link (RVFL) networks [1], [10], [30], [37]. As illustrated successively, RVFLs can be viewed as feedforward neural networks with a single hidden layer, resulting in a linear combination of a (fixed) number of non-linear expansions of the original input. A remarkable characteristics of such a learner model lies in the way of parameter assignment, that is, the input weights and biases are randomly chosen and fixed in advance before training. Despite this simplification, RVFLs can be shown to possess universal approximation capabilities, provided a sufficiently large set of expansions [17]. This grant them with a number of peculiar characteristics, making them particularly suited in a distributed environment. In particular, RVFL models are linear in the parameters, thus optimal parameters can be found with a standard linear regression routine, which can be implemented efficiently even in low-cost hardware, such as sensors or mobile devices [30]. In fact, the optimum of the training problem can be formulated in a closed form, involving only matrix inversions and multiplications, making the model efficient even when confronted with large amounts of data. Finally, the same formulation can be used equivalently in the classification and in the regression setting. In this paper, we focus on batch learning scheme development, however the proposed algorithms can be further extended for sequential learning with the use of standard gradient-descent procedures [10], whose decentralized formulation have been only partially investigated in the literature [34].
The key idea behind the proposed algorithms is to let all nodes train a local model (simultaneously) using the subset of training data, followed by finding the common output weights of the master learner model. Two effective approaches for defining the common output weights are adopted in this study. One is the Decentralized Average Consensus (DAC) strategy [28], and another is the well-known Alternating Direction Method of Multipliers (ADMM) algorithm [6]. DAC is an efficient protocol to compute averages over very general networks, with two main characteristics. Firstly, it does not require a centralized authority coordinating the overall process, and secondly, it can be easily implemented even on the most simple networks [16]. These characteristics have made DAC an attractive method in many distributed learning algorithms, particularly in the ‘learning by consensus’ theory outlined in [16]. From a theoretical viewpoint, the DAC-based algorithm is similar to a bagged ensemble of linear predictors [7], and despite its simplicity and non-optimal nature, our experimental simulations show that it results in highly competitive performance. The second strategy (ADMM) is the most widely employed distributed optimization algorithm in machine learning (e.g. for LASSO [6] and Support Vector Machines [14]), making it a natural candidate for the current research. This second strategy is more computational demanding than the DAC-based one, but it has high theoretical guarantees in term of convergence, speed and accuracy. Our simulation results obtained from both algorithms are quite promising and comparable to a centralized model exploiting the overall dataset. Moreover, the consensus strategy is extremely competitive on a large number of realistic network topologies.
The remainder of the paper is organized as follows. Section 2 briefly reviews RVFLs and its learning algorithm, and introduces the DAC algorithm. Section 3 describes the data-distributed learning framework, and proposes two training algorithms for RVFLs models. Sections 4 Experimental setup, 5 Results and discussion detail the experimental setup and the numerical results on four realistic datasets, plus an additional experiment on a large-scale image classification task, respectively. Section 6 concludes this paper with some discussions and future possible researches.
Section snippets
Preliminaries
This section provides some supportive results that will be used in the subsequent sections. We start from formulating some basic concepts related to RVFL networks with a least-square solution as its learning algorithm (Section 2.1). Then, we briefly introduce the DAC algorithm, for evaluating global averages under a decentralized information structure (Section 2.2).
Problem formulation
In the distributed learning setting, we consider a network of nodes as detailed in Section 2.2, and we suppose that the kth node, , has access to its own training set given by . Note that we identify each example with a double subscript , meaning the ith example of the kth node. Moreover, we assume that node k has examples available for training. In this case, extending Eq. (4), the global optimization problem can be stated as:
Description of the datasets
We tested our algorithms on four publicly available datasets, whose characteristics are summarized in Table 1. We have chosen them to represent different applicative domains of our algorithms, and to provide enough diversity in term of size, number of features, and imbalance of the classes:
- •
Garageband is a music classification problem [25], where the task is to discern among 9 different genres. As we stated in the introductory section, in the distributed case we can assume that the songs are
Accuracy and training times
The first set of experiments is to show that both algorithms that we propose are able to approximate very closely the centralized solution, irrespective of the number of nodes in the network. The topology of the network in these experiments is constructed according to the so-called ‘Erdős–Rényi model’ [27], i.e., once we have selected a number L of nodes, we randomly construct an adjacency matrix such that every edge has a probability p of appearing, with p specified a priori. For the moment,
Conclusions
Distributed learning has received considerable attention over the past years due to its broad real-world applications. It is common nowadays that data must be collected, stored locally and data exchange is not allowed for some reasons. In such a circumstance, it is necessary and useful to build a master learner model effectively and efficiently. In this paper, we have presented two distributed learning algorithms for training RVFL networks through interconnected nodes. These algorithms allow
Acknowledgment
The authors wish to thank Roberto Fierimonte, M.Sc., for his helpful comments and discussions.
References (43)
- et al.
Fast decorrelated neural network ensembles with random weights
Inf. Sci.
(2014) - et al.
The internet of things: a survey
Comput. Networks
(2010) - et al.
Distributed machine learning in networks by consensus
Neurocomputing
(2014) - et al.
Divergence-based feature selection for separate classes
Neurocomputing
(2013) Optimal distributed algorithms for minimum weight spanning tree, counting, leader election, and related problems
Predicting Structured Data
(2007)- et al.
Distributed detection and estimation in wireless sensor networks
- et al.
Distributed optimization and statistical learning via the alternating direction method of multipliers
Found. Trends® Mach. Learn.
(2011) Bagging predictors
Mach. Learn.
(1996)- et al.
Map-reduce for machine learning on multicore
Adv. Neural Inf. Process. Syst.
(2007)
Functional link adaptive filters for nonlinear acoustic echo cancellation
IEEE Trans. Audio Speech Lang. Process.
Mapreduce: simplified data processing on large clusters
Commun. ACM
Large scale distributed deep networks
Adv. Neural Inf. Process. Syst.
Optimal distributed online prediction using mini-batches
J. Mach. Learn. Res.
Consensus-based distributed support vector machines
J. Mach. Learn. Res.
The Elements of Statistical Learning
Stochastic choice of basis functions in adaptive function approximation and the functional-link net
IEEE Trans. Neural Networks
Fast distributed gradient methods
IEEE Trans. Autom. Control
The distributed boosting algorithm
Cited by (139)
An adaptive imbalance modified online broad learning system-based fault diagnosis for imbalanced chemical process data stream
2023, Expert Systems with ApplicationsFederated stochastic configuration networks for distributed data analytics
2022, Information SciencesDecentralized learning of randomization-based neural networks with centralized equivalence
2022, Applied Soft Computing