A two-phase half-async method for heterogeneity-aware federated learning

doi:10.1016/j.neucom.2021.08.146

Neurocomputing

Volume 485, 7 May 2022, Pages 134-154

https://doi.org/10.1016/j.neucom.2021.08.146 Get rights and content

Abstract

Federated learning (FL) is a distributed machine learning paradigm that allows training models on decentralized data over large-scale edge/mobile devices without collecting raw data. However, existing methods are still far from efficient and stable under extreme statistical and environmental heterogeneity. In this work, we propose FedHA (Federated Heterogeineity Awareness), a novel half-async algorithm which simultaneously incorporates the merits of asynchronous and synchronous methods. It separates the training into two phases by estimating the consistency of optimization directions of collected local models. It applies different strategies to facilitate fast and stable training, namely model selection, adaptive local epoch, and heterogeneity weighted aggregation in these phases. We provide theoretical convergence and communication guarantees on both convex and non-convex problems without introducing extra assumptions. In the first phase (the consistent phase), the convergence rate of FedHA is $O (\frac{1}{e^{T}})$ , which is faster than existing methods while reducing communication. In the second phase (inconsistent phase), FedHA retains the best-known results in convergence ( $O (\frac{1}{T})$ ) and communication ( $O (\frac{1}{∊})$ ). We validate our proposed algorithm on different tasks with both IID (Independently and Identically Distributed) and non-IID data, and results show that our algorithm is efficient, stable, and flexible under the twofold heterogeneity using the proposed strategies.

Introduction

With the rise of the Internet of Things (IoT), and the fifth-generation mobile network (5G), massive edge/mobile devices must process ever-more data. It introduces additional requirements and challenges for future use cases: 1) communication latency will be in 1 mill-second range, and 2) data rates per user device will be in 1–10 Gigabits-per second range, which implies 5–10 times reduction on latency and 10–100 times increase on user data rate, respectively.¹ In this case, as the data volume, data rates, and the number of edge devices are increasing tremendously, the whole life-cycle of data mining is becoming more challenging. Such scenarios require higher communication bandwidth and computing capability, which is costly in both time and energy-consuming. Thus, training strategies for deep learning models shift from centralized to distributed and decentralized paradigms in recent years, e.g., the distributed gradient descent based methods and the prevalent parameter server [39], [6], [7], [16], [11], [21].

However, distributed learning approaches always suffer from communication overhead and privacy risk [24]. Federated Learning (FL) [9] was proposed to alleviate the challenges mentioned above, and in the 5G era, it becomes one of the candidate paradigms for the next generation of machine learning on edge/mobile devices [22]. First, it avoids sharing sensitive data kept in local devices, thus protecting user privacy. Second, it deploys and trains models distributively without the need for dense computation in center servers. Resource-limited edge/mobile devices receive a randomly initialized model from the central server and train their local models quickly when they are idle with the local data only, which are relatively smaller in size (e.g., only dozens or hundreds of records per device). Then, the central server samples a subset of devices, next collects encrypted model parameters (e.g., gradients, weights, and biases), and aggregates collected models into a global model using specific aggregation algorithms, e.g., FedAvg [24] and its variants [31], [8], [17], [13], [26], [27]. Finally, the aggregated global model is sent to each device to replace their latest local models. These collaborative processes are shown in Fig. 1.

However, though FL was deliberately designed for statistical heterogeneity (e.g., the non-Independently and Identically Distributed data) and environmental heterogeneity (e.g., staleness and unreliable connections), most existing methods are still suffering from the performance degradation caused by them. First, data between devices are usually different and exhibit long-tail distribution. Such heterogeneity in data also introduces bias into models, e.g., the model may be biased towards commonly occurring devices or the devices with a more considerable amount of data [4], [18]. Solving statistical heterogeneity issues is not easy since FL does not allow data sharing between devices due to its fundamental requirements on privacy preservation, which makes estimation of non-IIDness difficult. Second, environmental heterogeneity causes stagnancy for some devices, which hinders the convergence of models [5]. Specifically, local models trained in these slow devices may degrade the global model’s performance since their distributions may be inconsistent with the current global model. Besides, such effect continuously accumulates in these devices since the running environment of stale devices may keep unstable during the whole life-cycle of training processes. The “slow” models have less chance to be aggregated into the global model, magnifying the degree of differences. Furthermore, synchronous federated learning methods are costly in aggregation [5], and they are potentially inefficient for large-scale devices since the server halts when collecting models, even from a much smaller subset of devices [34].

Scholars have proposed asynchronous federated learning (AFL) approaches that address these problems without waiting for slow devices [25], [20], [33]. However, asynchrony adversely magnifies the staleness effect on non-IID data because such data are usually long-tailed in distributed heterogeneous devices, where “Matthew Effect” exists—the staleness continuously accumulates in slow devices. Eventually, it adversely affects the overall performance.

This paper focuses on performance issues in FL caused by non-IID data (statistical heterogeneity) and slow devices (environmental heterogeneity). The main contributions of this paper are summarized as follows:

•
We propose a half-async method named FedHA (Federated Heterogeineity Awareness). It trains models in two-phase using different strategies, namely model selection, adaptive local epoch, and heterogeneity weighted aggregation. It retains the efficiency of asynchronous methods and the communication efficiency of synchronous methods simultaneously. FedHA is the first FL method incorporating half-async model training from a two-phase perspective to the best of our knowledge.
•
We provide the theoretical guarantees of FedHA for both convex and non-convex problems. Our analysis provides the first complete convergence guarantees for async/half-async FL that do not require additional assumptions to the best of our knowledge. In comparison, FedHA reaches $O (\frac{1}{e^{T}})$ in convergence rate and $O (1)$ in communication bound in phase I, superior to the best-known methods. In phase II, FedHA gives comparable results as the best-known methods.
•
We verify the efficiency of FedHA and its strategies by conducting empirical experiments. The experiments are performed with convex and non-convex models on multiple datasets. Results show that FedHA consistently outperforms state-of-the-art FL/AFL methods on the accuracy, communication efficiency, and flexibility.

Section snippets

Federated learning on non-IID data

Though it is considered promising for non-IID data [24], the performance of federated learning is not stable. Zhao et al. [38] argued that FedAvg suffers from significant performance degradation with highly-skewed non-IID data. They further used Earth Mover Distance (EMD) to calculate weight divergence between local models, reflecting the overall data non-IIDness. However, a small subset of data has to be shared globally in their proposed method, which is not practical in a real-world FL

Preliminaries

We first summarize the general FL form. Suppose an FL system consists of M distributed edge/mobile devices (e.g., mobile phones) and a server node that aggregates collected models. The goal is to train a global model across these devices without uploading their private local data. At the $t^{th}$ global communication round, an available device i ( $i \in M$ , and $M = {1, 2, 3, \dots, M}$ ) receives the current global model $ω^{t}$ from the server as its initial local model $ω_{i}^{t}$ . Then, it uses an optimizer (e.g., stochastic

Proposed algorithms

In this section, we present FedHA and its specific strategies. We first introduce the design motivations of FedHA. Then we illustrate in detail the two-phase processes that contain an evaluation metric on training progress and distinct strategies, namely Consistency of Direction (COD), Model Selection (MS), Adaptive Local Epoch (AE), and the Heterogeneity Weight Aggregation (HWA).

Convergence analysis

We theoretically analyze the performance of FedHA on both convex and non-convex problems in this section. First, we make the following underlying assumptions widely used in previous literature, e.g., [1], [30], [36], [19].

Assumption 1

The loss functions $F (ω)$ and $F_{i} (ω)$ for all i in $M$ are $β$ -smooth ( $β > 0$ ), i.e., for any $ω$ and $ω^{'}, ‖ \nabla F (ω) - \nabla F (ω^{'}) ‖ ⩽ β ‖ ω - ω^{'} ‖$ and $‖ \nabla F_{i} (ω) - \nabla F_{i} (ω^{'}) ‖ ⩽ β ‖ ω - ω^{'} ‖$ .

Assumption 2

Gradients of the loss functions $F (ω)$ and $F_{i} (ω)$ are bounded, i.e., (a) The gradients of $F (ω)$ are bounded by G $(G > 0), \forall ω$ , i.e., $‖ \nabla F (ω) ‖ ⩽ G$

Experiments and analysis

We conduct empirical experiments in this section to answer the following research questions:

•
RQ1: Can FedHA outperform the state-of-the-art methods on both IID and non-IID data (statistical heterogeneity)?
•
RQ2: Can FedHA perform efficiently under the high impact of staleness (environmental heterogeneity)?
•
RQ3: How does each strategy contribute to the performance of FedHA? How to choose appropriate strategies under different conditions?
•
RQ4: How do the hyper-parameters influence the performance of

Conclusion and future work

This paper has proposed a novel federated learning approach that is efficient and stable under statistical and environmental heterogeneity. It dynamically accelerates convergence on non-IID data and resists performance deterioration caused by the staleness effect simultaneously using a two-phase training mechanism. Theoretical analysis and experimental results prove that our approach converges faster with fewer communication rounds than baselines and can resist the staleness effect without

CRediT authorship contribution statement

Tianyi Ma: Conceptualization, Methodology, Investigation, Project administration, Supervision, Writing – original draft, Writing – review & editing, Software, Validation, Resources. Bingcheng Mao: Methodology, Writing – original draft, Writing – review & editing, Software, Formal analysis, Validation, Visualization. Ming Chen: Conceptualization, Project administration, Supervision, Writing-original-draft, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Tianyi Ma received the Ph.D. degree in Computer Science and Technology with the Department of Electrical, Computer and Biomedical Engineering from University of Pavia, Italy, 2017. He is a postdoctoral researcher at the College of Computer Science and Technology, Zhejiang University, and works with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, recommender systems and explainable AI.

References (39)

L. Georgopoulos et al.
Distributed machine learning in networks by consensus
Neurocomputing
(2014)
L. Bottou et al.
Optimization methods for large-scale machine learning
Siam Rev.
(2018)
C. Briggs, Z. Fan, P. Andras, Federated learning with hierarchical clustering of local updates to improve training on...
S. Caldas, P. Wu, T. Li, J. Konečný, H.B. McMahan, V. Smith, A. Talwalkar, Leaf: a benchmark for federated settings,...
L. Cao
Non-iidness learning in behavioral and social data
Comput. J.
(2014)
T. Chen et al.
Lag: lazily aggregated gradient for communication-efficient distributed learning
Adv. Neural Inf. Process. Syst.
(2018)
J. Dean et al.
Large scale distributed deep networks
Adv. Neural Inf. Process. Syst.
(2012)
M. Kamp, L. Adilova, J. Sicking, F. Hüger, P. Schlicht, T. Wirtz, S. Wrobel, Efficient decentralized deep learning by...
J. Konečný, H.B. McMahan, D. Ramage, P. Richtárik, Federated optimization: Distributed machine learning for on-device...
A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images,...

G. Lan et al.

Communication-efficient algorithms for decentralized and stochastic optimization

Math. Program.

(2017)

Y. LeCun, The mnist database of handwritten digits, 1998. http://yann. lecun....

D. Leroy et al.

Federated learning for keyword spotting

H. Li, Z. Xu, G. Taylor, C. Studer, T. Goldstein, Visualizing the loss landscape of neural nets, in: Neural Information...

L. Li et al.

Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets

M. Li et al.

Communication efficient distributed machine learning with the parameter server

Adv. Neural Inf. Process. Syst.

(2014)

T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks,...

T. Li, M. Sanjabi, A. Beirami, V. Smith, Fair resource allocation in federated learning, 2019b. arXiv preprint...

X. Li, K. Huang, W. Yang, S. Wang, Z. Zhang, On the convergence of fedavg on non-iid data, 2019c. arXiv preprint...

Cited by (3)

A two-stage federated optimization algorithm for privacy computing in Internet of Things
2023, Future Generation Computer Systems
With the advent of the Internet of things (IoT) era, federated learning plays an important role in breaking through traditional data barriers and effectively realizing data privacy and security in the process of sharing. However, the demand of the practical problems makes the algorithm still have great challenges in effectively balancing various factors, such as privacy security, accuracy, computing efficiency and so on. To challenge this problem, a two-stage federated optimization algorithm based on robust and multitasking learning is designed. In optimization client local model stage, an adaptive weight assignment mechanism is adopted to guide robust learning based on multiple untrusted sources data, which aims to ensure the credibility and robustness of the client model and obtain a reliable client model. To address the information leakage problem during the server–client global model optimization stage, a privacy patch layer is added to the client local model in multitask learning and maintain its privacy parameters stored on the client model during the global model parameter aggregation process, which aims to meet the personalized requirements and performance requirements of protecting the model privacy. To effectively measure the performance of our algorithm, two extensive experiments are carried out to verify the robustness and accuracy of model under different datasets, respectively. In addition, simulation results show that our algorithm successfully suppresses the impact of corrupted or irrelevant sources on performance, and its performance is better than the other two robust distributed learning baseline methods in the client local model optimization stage. At the same time, our algorithm achieves better accuracy performance than other advanced personalized optimization algorithms in the server–client global model optimization stage. Finally, it achieves a good balance between robustness, computational efficiency and model privacy protection.
Adaptive Fairness Federal Learning
2023, SSRN
Research on Open Innovation Intelligent Decision-Making of Cross-Border E-Commerce Based on Federated Learning
2022, Mathematical Problems in Engineering

Bingcheng Mao received his B.S. and M.S. degrees in College of Computer Science and Technology from Nanjing University of Aeronautics and Astronautics, China, in 2016 and 2019, respectively. He is currently working with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, machine learning and computer vision.

Ming Chen received the Ph.D. degree in Computer Science and Technology from Zhejiang University. He is with the College of Computer Science and Technology, Zhejiang University, and serves as the Director of AI Research Institute, Hithink RoyalFlush Information Network Co., Ltd. His current research interests include machine learning, big data and computer vision.

View full text

A two-phase half-async method for heterogeneity-aware federated learning

Abstract

Introduction

Section snippets

Federated learning on non-IID data

Preliminaries

Proposed algorithms

Convergence analysis

Experiments and analysis

Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Neurocomputing

Optimization methods for large-scale machine learning

Siam Rev.

Non-iidness learning in behavioral and social data

Comput. J.

Lag: lazily aggregated gradient for communication-efficient distributed learning

Adv. Neural Inf. Process. Syst.

Large scale distributed deep networks

Adv. Neural Inf. Process. Syst.

Communication-efficient algorithms for decentralized and stochastic optimization

Math. Program.

Federated learning for keyword spotting

Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets

Communication efficient distributed machine learning with the parameter server

Adv. Neural Inf. Process. Syst.