Elsevier

Neurocomputing

Volume 485, 7 May 2022, Pages 134-154
Neurocomputing

A two-phase half-async method for heterogeneity-aware federated learning

https://doi.org/10.1016/j.neucom.2021.08.146Get rights and content

Abstract

Federated learning (FL) is a distributed machine learning paradigm that allows training models on decentralized data over large-scale edge/mobile devices without collecting raw data. However, existing methods are still far from efficient and stable under extreme statistical and environmental heterogeneity. In this work, we propose FedHA (Federated Heterogeineity Awareness), a novel half-async algorithm which simultaneously incorporates the merits of asynchronous and synchronous methods. It separates the training into two phases by estimating the consistency of optimization directions of collected local models. It applies different strategies to facilitate fast and stable training, namely model selection, adaptive local epoch, and heterogeneity weighted aggregation in these phases. We provide theoretical convergence and communication guarantees on both convex and non-convex problems without introducing extra assumptions. In the first phase (the consistent phase), the convergence rate of FedHA is O1eT, which is faster than existing methods while reducing communication. In the second phase (inconsistent phase), FedHA retains the best-known results in convergence (O1T) and communication (O1). We validate our proposed algorithm on different tasks with both IID (Independently and Identically Distributed) and non-IID data, and results show that our algorithm is efficient, stable, and flexible under the twofold heterogeneity using the proposed strategies.

Introduction

With the rise of the Internet of Things (IoT), and the fifth-generation mobile network (5G), massive edge/mobile devices must process ever-more data. It introduces additional requirements and challenges for future use cases: 1) communication latency will be in 1 mill-second range, and 2) data rates per user device will be in 1–10 Gigabits-per second range, which implies 5–10 times reduction on latency and 10–100 times increase on user data rate, respectively.1 In this case, as the data volume, data rates, and the number of edge devices are increasing tremendously, the whole life-cycle of data mining is becoming more challenging. Such scenarios require higher communication bandwidth and computing capability, which is costly in both time and energy-consuming. Thus, training strategies for deep learning models shift from centralized to distributed and decentralized paradigms in recent years, e.g., the distributed gradient descent based methods and the prevalent parameter server [39], [6], [7], [16], [11], [21].

However, distributed learning approaches always suffer from communication overhead and privacy risk [24]. Federated Learning (FL) [9] was proposed to alleviate the challenges mentioned above, and in the 5G era, it becomes one of the candidate paradigms for the next generation of machine learning on edge/mobile devices [22]. First, it avoids sharing sensitive data kept in local devices, thus protecting user privacy. Second, it deploys and trains models distributively without the need for dense computation in center servers. Resource-limited edge/mobile devices receive a randomly initialized model from the central server and train their local models quickly when they are idle with the local data only, which are relatively smaller in size (e.g., only dozens or hundreds of records per device). Then, the central server samples a subset of devices, next collects encrypted model parameters (e.g., gradients, weights, and biases), and aggregates collected models into a global model using specific aggregation algorithms, e.g., FedAvg [24] and its variants [31], [8], [17], [13], [26], [27]. Finally, the aggregated global model is sent to each device to replace their latest local models. These collaborative processes are shown in Fig. 1.

However, though FL was deliberately designed for statistical heterogeneity (e.g., the non-Independently and Identically Distributed data) and environmental heterogeneity (e.g., staleness and unreliable connections), most existing methods are still suffering from the performance degradation caused by them. First, data between devices are usually different and exhibit long-tail distribution. Such heterogeneity in data also introduces bias into models, e.g., the model may be biased towards commonly occurring devices or the devices with a more considerable amount of data [4], [18]. Solving statistical heterogeneity issues is not easy since FL does not allow data sharing between devices due to its fundamental requirements on privacy preservation, which makes estimation of non-IIDness difficult. Second, environmental heterogeneity causes stagnancy for some devices, which hinders the convergence of models [5]. Specifically, local models trained in these slow devices may degrade the global model’s performance since their distributions may be inconsistent with the current global model. Besides, such effect continuously accumulates in these devices since the running environment of stale devices may keep unstable during the whole life-cycle of training processes. The “slow” models have less chance to be aggregated into the global model, magnifying the degree of differences. Furthermore, synchronous federated learning methods are costly in aggregation [5], and they are potentially inefficient for large-scale devices since the server halts when collecting models, even from a much smaller subset of devices [34].

Scholars have proposed asynchronous federated learning (AFL) approaches that address these problems without waiting for slow devices [25], [20], [33]. However, asynchrony adversely magnifies the staleness effect on non-IID data because such data are usually long-tailed in distributed heterogeneous devices, where “Matthew Effect” exists—the staleness continuously accumulates in slow devices. Eventually, it adversely affects the overall performance.

This paper focuses on performance issues in FL caused by non-IID data (statistical heterogeneity) and slow devices (environmental heterogeneity). The main contributions of this paper are summarized as follows:

  • We propose a half-async method named FedHA (Federated Heterogeineity Awareness). It trains models in two-phase using different strategies, namely model selection, adaptive local epoch, and heterogeneity weighted aggregation. It retains the efficiency of asynchronous methods and the communication efficiency of synchronous methods simultaneously. FedHA is the first FL method incorporating half-async model training from a two-phase perspective to the best of our knowledge.

  • We provide the theoretical guarantees of FedHA for both convex and non-convex problems. Our analysis provides the first complete convergence guarantees for async/half-async FL that do not require additional assumptions to the best of our knowledge. In comparison, FedHA reaches O1eT in convergence rate and O(1) in communication bound in phase I, superior to the best-known methods. In phase II, FedHA gives comparable results as the best-known methods.

  • We verify the efficiency of FedHA and its strategies by conducting empirical experiments. The experiments are performed with convex and non-convex models on multiple datasets. Results show that FedHA consistently outperforms state-of-the-art FL/AFL methods on the accuracy, communication efficiency, and flexibility.

Section snippets

Federated learning on non-IID data

Though it is considered promising for non-IID data [24], the performance of federated learning is not stable. Zhao et al. [38] argued that FedAvg suffers from significant performance degradation with highly-skewed non-IID data. They further used Earth Mover Distance (EMD) to calculate weight divergence between local models, reflecting the overall data non-IIDness. However, a small subset of data has to be shared globally in their proposed method, which is not practical in a real-world FL

Preliminaries

We first summarize the general FL form. Suppose an FL system consists of M distributed edge/mobile devices (e.g., mobile phones) and a server node that aggregates collected models. The goal is to train a global model across these devices without uploading their private local data. At the tth global communication round, an available device i (iM, and M={1,2,3,,M}) receives the current global model ωt from the server as its initial local model ωit. Then, it uses an optimizer (e.g., stochastic

Proposed algorithms

In this section, we present FedHA and its specific strategies. We first introduce the design motivations of FedHA. Then we illustrate in detail the two-phase processes that contain an evaluation metric on training progress and distinct strategies, namely Consistency of Direction (COD), Model Selection (MS), Adaptive Local Epoch (AE), and the Heterogeneity Weight Aggregation (HWA).

Convergence analysis

We theoretically analyze the performance of FedHA on both convex and non-convex problems in this section. First, we make the following underlying assumptions widely used in previous literature, e.g., [1], [30], [36], [19].

Assumption 1

The loss functions F(ω) and Fi(ω) for all i in M are β-smooth (β>0), i.e., for any ω and ω,F(ω)-F(ω)βω-ω and Fi(ω)-Fi(ω)βω-ω.

Assumption 2

Gradients of the loss functions F(ω) and Fi(ω) are bounded, i.e., (a) The gradients of F(ω) are bounded by G (G>0),ω, i.e., F(ω)G

Experiments and analysis

We conduct empirical experiments in this section to answer the following research questions:

  • RQ1: Can FedHA outperform the state-of-the-art methods on both IID and non-IID data (statistical heterogeneity)?

  • RQ2: Can FedHA perform efficiently under the high impact of staleness (environmental heterogeneity)?

  • RQ3: How does each strategy contribute to the performance of FedHA? How to choose appropriate strategies under different conditions?

  • RQ4: How do the hyper-parameters influence the performance of

Conclusion and future work

This paper has proposed a novel federated learning approach that is efficient and stable under statistical and environmental heterogeneity. It dynamically accelerates convergence on non-IID data and resists performance deterioration caused by the staleness effect simultaneously using a two-phase training mechanism. Theoretical analysis and experimental results prove that our approach converges faster with fewer communication rounds than baselines and can resist the staleness effect without

CRediT authorship contribution statement

Tianyi Ma: Conceptualization, Methodology, Investigation, Project administration, Supervision, Writing – original draft, Writing – review & editing, Software, Validation, Resources. Bingcheng Mao: Methodology, Writing – original draft, Writing – review & editing, Software, Formal analysis, Validation, Visualization. Ming Chen: Conceptualization, Project administration, Supervision, Writing-original-draft, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Tianyi Ma received the Ph.D. degree in Computer Science and Technology with the Department of Electrical, Computer and Biomedical Engineering from University of Pavia, Italy, 2017. He is a postdoctoral researcher at the College of Computer Science and Technology, Zhejiang University, and works with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, recommender systems and explainable AI.

References (39)

  • L. Georgopoulos et al.

    Distributed machine learning in networks by consensus

    Neurocomputing

    (2014)
  • L. Bottou et al.

    Optimization methods for large-scale machine learning

    Siam Rev.

    (2018)
  • C. Briggs, Z. Fan, P. Andras, Federated learning with hierarchical clustering of local updates to improve training on...
  • S. Caldas, P. Wu, T. Li, J. Konečný, H.B. McMahan, V. Smith, A. Talwalkar, Leaf: a benchmark for federated settings,...
  • L. Cao

    Non-iidness learning in behavioral and social data

    Comput. J.

    (2014)
  • T. Chen et al.

    Lag: lazily aggregated gradient for communication-efficient distributed learning

    Adv. Neural Inf. Process. Syst.

    (2018)
  • J. Dean et al.

    Large scale distributed deep networks

    Adv. Neural Inf. Process. Syst.

    (2012)
  • M. Kamp, L. Adilova, J. Sicking, F. Hüger, P. Schlicht, T. Wirtz, S. Wrobel, Efficient decentralized deep learning by...
  • J. Konečný, H.B. McMahan, D. Ramage, P. Richtárik, Federated optimization: Distributed machine learning for on-device...
  • A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images,...
  • G. Lan et al.

    Communication-efficient algorithms for decentralized and stochastic optimization

    Math. Program.

    (2017)
  • Y. LeCun, The mnist database of handwritten digits, 1998. http://yann. lecun....
  • D. Leroy et al.

    Federated learning for keyword spotting

  • H. Li, Z. Xu, G. Taylor, C. Studer, T. Goldstein, Visualizing the loss landscape of neural nets, in: Neural Information...
  • L. Li et al.

    Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets

  • M. Li et al.

    Communication efficient distributed machine learning with the parameter server

    Adv. Neural Inf. Process. Syst.

    (2014)
  • T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks,...
  • T. Li, M. Sanjabi, A. Beirami, V. Smith, Fair resource allocation in federated learning, 2019b. arXiv preprint...
  • X. Li, K. Huang, W. Yang, S. Wang, Z. Zhang, On the convergence of fedavg on non-iid data, 2019c. arXiv preprint...
  • Cited by (3)

    Tianyi Ma received the Ph.D. degree in Computer Science and Technology with the Department of Electrical, Computer and Biomedical Engineering from University of Pavia, Italy, 2017. He is a postdoctoral researcher at the College of Computer Science and Technology, Zhejiang University, and works with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, recommender systems and explainable AI.

    Bingcheng Mao received his B.S. and M.S. degrees in College of Computer Science and Technology from Nanjing University of Aeronautics and Astronautics, China, in 2016 and 2019, respectively. He is currently working with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, machine learning and computer vision.

    Ming Chen received the Ph.D. degree in Computer Science and Technology from Zhejiang University. He is with the College of Computer Science and Technology, Zhejiang University, and serves as the Director of AI Research Institute, Hithink RoyalFlush Information Network Co., Ltd. His current research interests include machine learning, big data and computer vision.

    View full text