A two-phase half-async method for heterogeneity-aware federated learning
Introduction
With the rise of the Internet of Things (IoT), and the fifth-generation mobile network (5G), massive edge/mobile devices must process ever-more data. It introduces additional requirements and challenges for future use cases: 1) communication latency will be in 1 mill-second range, and 2) data rates per user device will be in 1–10 Gigabits-per second range, which implies 5–10 times reduction on latency and 10–100 times increase on user data rate, respectively.1 In this case, as the data volume, data rates, and the number of edge devices are increasing tremendously, the whole life-cycle of data mining is becoming more challenging. Such scenarios require higher communication bandwidth and computing capability, which is costly in both time and energy-consuming. Thus, training strategies for deep learning models shift from centralized to distributed and decentralized paradigms in recent years, e.g., the distributed gradient descent based methods and the prevalent parameter server [39], [6], [7], [16], [11], [21].
However, distributed learning approaches always suffer from communication overhead and privacy risk [24]. Federated Learning (FL) [9] was proposed to alleviate the challenges mentioned above, and in the 5G era, it becomes one of the candidate paradigms for the next generation of machine learning on edge/mobile devices [22]. First, it avoids sharing sensitive data kept in local devices, thus protecting user privacy. Second, it deploys and trains models distributively without the need for dense computation in center servers. Resource-limited edge/mobile devices receive a randomly initialized model from the central server and train their local models quickly when they are idle with the local data only, which are relatively smaller in size (e.g., only dozens or hundreds of records per device). Then, the central server samples a subset of devices, next collects encrypted model parameters (e.g., gradients, weights, and biases), and aggregates collected models into a global model using specific aggregation algorithms, e.g., FedAvg [24] and its variants [31], [8], [17], [13], [26], [27]. Finally, the aggregated global model is sent to each device to replace their latest local models. These collaborative processes are shown in Fig. 1.
However, though FL was deliberately designed for statistical heterogeneity (e.g., the non-Independently and Identically Distributed data) and environmental heterogeneity (e.g., staleness and unreliable connections), most existing methods are still suffering from the performance degradation caused by them. First, data between devices are usually different and exhibit long-tail distribution. Such heterogeneity in data also introduces bias into models, e.g., the model may be biased towards commonly occurring devices or the devices with a more considerable amount of data [4], [18]. Solving statistical heterogeneity issues is not easy since FL does not allow data sharing between devices due to its fundamental requirements on privacy preservation, which makes estimation of non-IIDness difficult. Second, environmental heterogeneity causes stagnancy for some devices, which hinders the convergence of models [5]. Specifically, local models trained in these slow devices may degrade the global model’s performance since their distributions may be inconsistent with the current global model. Besides, such effect continuously accumulates in these devices since the running environment of stale devices may keep unstable during the whole life-cycle of training processes. The “slow” models have less chance to be aggregated into the global model, magnifying the degree of differences. Furthermore, synchronous federated learning methods are costly in aggregation [5], and they are potentially inefficient for large-scale devices since the server halts when collecting models, even from a much smaller subset of devices [34].
Scholars have proposed asynchronous federated learning (AFL) approaches that address these problems without waiting for slow devices [25], [20], [33]. However, asynchrony adversely magnifies the staleness effect on non-IID data because such data are usually long-tailed in distributed heterogeneous devices, where “Matthew Effect” exists—the staleness continuously accumulates in slow devices. Eventually, it adversely affects the overall performance.
This paper focuses on performance issues in FL caused by non-IID data (statistical heterogeneity) and slow devices (environmental heterogeneity). The main contributions of this paper are summarized as follows:
- •
We propose a half-async method named FedHA (Federated Heterogeineity Awareness). It trains models in two-phase using different strategies, namely model selection, adaptive local epoch, and heterogeneity weighted aggregation. It retains the efficiency of asynchronous methods and the communication efficiency of synchronous methods simultaneously. FedHA is the first FL method incorporating half-async model training from a two-phase perspective to the best of our knowledge.
- •
We provide the theoretical guarantees of FedHA for both convex and non-convex problems. Our analysis provides the first complete convergence guarantees for async/half-async FL that do not require additional assumptions to the best of our knowledge. In comparison, FedHA reaches in convergence rate and in communication bound in phase I, superior to the best-known methods. In phase II, FedHA gives comparable results as the best-known methods.
- •
We verify the efficiency of FedHA and its strategies by conducting empirical experiments. The experiments are performed with convex and non-convex models on multiple datasets. Results show that FedHA consistently outperforms state-of-the-art FL/AFL methods on the accuracy, communication efficiency, and flexibility.
Section snippets
Federated learning on non-IID data
Though it is considered promising for non-IID data [24], the performance of federated learning is not stable. Zhao et al. [38] argued that FedAvg suffers from significant performance degradation with highly-skewed non-IID data. They further used Earth Mover Distance (EMD) to calculate weight divergence between local models, reflecting the overall data non-IIDness. However, a small subset of data has to be shared globally in their proposed method, which is not practical in a real-world FL
Preliminaries
We first summarize the general FL form. Suppose an FL system consists of M distributed edge/mobile devices (e.g., mobile phones) and a server node that aggregates collected models. The goal is to train a global model across these devices without uploading their private local data. At the global communication round, an available device i (, and ) receives the current global model from the server as its initial local model . Then, it uses an optimizer (e.g., stochastic
Proposed algorithms
In this section, we present FedHA and its specific strategies. We first introduce the design motivations of FedHA. Then we illustrate in detail the two-phase processes that contain an evaluation metric on training progress and distinct strategies, namely Consistency of Direction (COD), Model Selection (MS), Adaptive Local Epoch (AE), and the Heterogeneity Weight Aggregation (HWA).
Convergence analysis
We theoretically analyze the performance of FedHA on both convex and non-convex problems in this section. First, we make the following underlying assumptions widely used in previous literature, e.g., [1], [30], [36], [19]. Assumption 1 The loss functions and for all i in are -smooth (), i.e., for any and and . Assumption 2 Gradients of the loss functions and are bounded, i.e., (a) The gradients of are bounded by G , i.e.,
Experiments and analysis
We conduct empirical experiments in this section to answer the following research questions:
- •
RQ1: Can FedHA outperform the state-of-the-art methods on both IID and non-IID data (statistical heterogeneity)?
- •
RQ2: Can FedHA perform efficiently under the high impact of staleness (environmental heterogeneity)?
- •
RQ3: How does each strategy contribute to the performance of FedHA? How to choose appropriate strategies under different conditions?
- •
RQ4: How do the hyper-parameters influence the performance of
Conclusion and future work
This paper has proposed a novel federated learning approach that is efficient and stable under statistical and environmental heterogeneity. It dynamically accelerates convergence on non-IID data and resists performance deterioration caused by the staleness effect simultaneously using a two-phase training mechanism. Theoretical analysis and experimental results prove that our approach converges faster with fewer communication rounds than baselines and can resist the staleness effect without
CRediT authorship contribution statement
Tianyi Ma: Conceptualization, Methodology, Investigation, Project administration, Supervision, Writing – original draft, Writing – review & editing, Software, Validation, Resources. Bingcheng Mao: Methodology, Writing – original draft, Writing – review & editing, Software, Formal analysis, Validation, Visualization. Ming Chen: Conceptualization, Project administration, Supervision, Writing-original-draft, Resources.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Tianyi Ma received the Ph.D. degree in Computer Science and Technology with the Department of Electrical, Computer and Biomedical Engineering from University of Pavia, Italy, 2017. He is a postdoctoral researcher at the College of Computer Science and Technology, Zhejiang University, and works with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, recommender systems and explainable AI.
References (39)
- et al.
Distributed machine learning in networks by consensus
Neurocomputing
(2014) - et al.
Optimization methods for large-scale machine learning
Siam Rev.
(2018) - C. Briggs, Z. Fan, P. Andras, Federated learning with hierarchical clustering of local updates to improve training on...
- S. Caldas, P. Wu, T. Li, J. Konečný, H.B. McMahan, V. Smith, A. Talwalkar, Leaf: a benchmark for federated settings,...
Non-iidness learning in behavioral and social data
Comput. J.
(2014)- et al.
Lag: lazily aggregated gradient for communication-efficient distributed learning
Adv. Neural Inf. Process. Syst.
(2018) - et al.
Large scale distributed deep networks
Adv. Neural Inf. Process. Syst.
(2012) - M. Kamp, L. Adilova, J. Sicking, F. Hüger, P. Schlicht, T. Wirtz, S. Wrobel, Efficient decentralized deep learning by...
- J. Konečný, H.B. McMahan, D. Ramage, P. Richtárik, Federated optimization: Distributed machine learning for on-device...
- A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images,...
Communication-efficient algorithms for decentralized and stochastic optimization
Math. Program.
Federated learning for keyword spotting
Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets
Communication efficient distributed machine learning with the parameter server
Adv. Neural Inf. Process. Syst.
Cited by (3)
A two-stage federated optimization algorithm for privacy computing in Internet of Things
2023, Future Generation Computer SystemsAdaptive Fairness Federal Learning
2023, SSRNResearch on Open Innovation Intelligent Decision-Making of Cross-Border E-Commerce Based on Federated Learning
2022, Mathematical Problems in Engineering
Tianyi Ma received the Ph.D. degree in Computer Science and Technology with the Department of Electrical, Computer and Biomedical Engineering from University of Pavia, Italy, 2017. He is a postdoctoral researcher at the College of Computer Science and Technology, Zhejiang University, and works with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, recommender systems and explainable AI.
Bingcheng Mao received his B.S. and M.S. degrees in College of Computer Science and Technology from Nanjing University of Aeronautics and Astronautics, China, in 2016 and 2019, respectively. He is currently working with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, machine learning and computer vision.
Ming Chen received the Ph.D. degree in Computer Science and Technology from Zhejiang University. He is with the College of Computer Science and Technology, Zhejiang University, and serves as the Director of AI Research Institute, Hithink RoyalFlush Information Network Co., Ltd. His current research interests include machine learning, big data and computer vision.