FedSA: A staleness-aware asynchronous Federated Learning algorithm with non-IID data

https://doi.org/10.1016/j.future.2021.02.012Get rights and content

Highlights

  • Delayed devices with non-IID data degrade performance in federated learning systems.

  • Present FedSA, a staleness-aware and asynchronous method for federated learning.

  • Present new analytical results for a unified form of federated learning.

  • Present a novel two-stage strategy to accelerate training from analytical results.

  • Present optimal strategies for hyper-parameters and speed-communication trade-offs.

Abstract

This paper presents new asynchronous methods to the Federated Learning (FL), one of the next-generation paradigms for Artificial Intelligence (AI) systems. We consider the two-fold challenges lay ahead. First, non-IID (non-Independent and Identically Distributed) data across devices cause unstable performance. Second, unreliable and slow environments not only slow the convergence but also cause staleness issues. To address these challenges, this study uses a bottom-up approach for analysis and algorithm design. We first reformulate FL by unifying both synchronous and asynchronous updating schemes with an asynchrony-related parameter. We theoretically analyze this new form and find practical strategies for optimization. The key findings include: 1) a two-stage training strategy to accelerate training and reduce communication overhead; 2) strategies of choosing key hyperparameters optimally for these stages to maintain efficiency and robustness. With these theoretical guarantees, we propose FedSA (Federated Staleness-Aware), a novel asynchronous federated learning algorithm. We validate FedSA on different tasks with non-IID/IID and staleness settings. Our results indicate that, given a large proportion of stale devices, the proposed algorithm presents state-of-the-art performance by outperforming existing methods on both non-IID and IID cases.

Introduction

The recently emerged Federated Learning (FL) is a privacy-preserving paradigm for the next-generation Artificial Intelligence (AI) that trains a global model across large-scale devices without disclosing local data. The processes of this paradigm can be summarized as follows: devices first download a global model (or the initialization of a global model) from a server as their local models and train them on their local data. These newly trained models are uploaded to the server. Finally, the server applies weighted averaging (e.g., FedAvg [1]) to aggregate uploaded models into the global model, and broadcasts the new model to devices.

Unfortunately, substantial challenges lay ahead in implementing an efficient and robust FL. First, statistical heterogeneity exists in non-IID (non-Independent and Identically Distributed) data, where data are highly skewed, extremely imbalanced, and vary over devices [1], [2]. This heterogeneity may cause distribution shifts, which necessitates an evaluation of such heterogeneity. Concerning the data protection purpose, sharing data between devices is impractical for FL, thus raising the difficulties of perceiving heterogeneity. Second, unreliable connections and limited computing resources in heterogeneous environments cause environmental heterogeneity, where part of devices are inactive or slow. Studies have shown that the existence of these devices is proven to degrade the performance of the global model [2], [3]. Furthermore, staleness accumulates since the slow devices may keep nonresponse during the training processes, resulting in a lower chance of getting aggregated for local models. We call such effect as staleness effect, which magnifies the degree of statistical heterogeneity since the discrepancies between stale models and the global model may keep increasing. Finally, synchronous methods are too expensive to be used, especially for large-scale devices, as the server has to wait until all or sampled devices are ready [4].

To date, several studies have investigated Asynchronous Federated Learning (AFL), which allows model aggregation without waiting for stale devices [5], [6]. However, introducing asynchrony would aggravate the staleness effect. The distribution of real-world data in devices is usually long-tailed, where “the rich get richer”. For example, data may increase more rapidly in some devices, which requires more processing and training time while computing capabilities remain unchanged. Eventually, the above accumulation in delay affects the convergence of the global model. Furthermore, the bottleneck exists in communication since asynchrony requires more frequent communications between devices and the server [1], [7]. AFL further brings new challenges in the dynamics of hyperparameters and communication-accuracy trade-offs. Finally, the theoretical convergence guarantees of AFL with non-IID data, and practical guidelines for designing efficient algorithms in this concept are still unknown.

Contributions. This paper presents a theoretical analysis and a novel AFL algorithm to alleviate the issues mentioned above. Specifically, our main contributions are summarized as follows:

  • We extend the form of FL by introducing an asynchronous scheme with an asynchrony-related parameter I and provide a theoretical analysis of this new form. Concretely, we suggest new assumptions for non-IIDness, which enables us to measure and bound the impact of non-IIDness using model discrepancies. We analyze and find practical strategies for non-IIDness and staleness. The key results include: 1) a two-stage training strategy to accelerate training; 2) optimal strategies for choosing proper key hyperparameters. To the best of our knowledge, it is the first analysis for AFL in the context of the two-fold heterogeneity with non-IID data, and the first FL approach that identifies and applies the above two-stage training strategy. Our analysis provides not just the performance guarantees but likewise the strategies for designing efficient and robust AFL algorithms.

  • Following the guidelines and implications from the above analysis, we propose FedSA, a novel asynchronous FL algorithm. It uses the two-stage strategy and dynamically chooses proper hyperparameters by simultaneously considering the two-fold heterogeneity in devices and the similarities among local models.

  • We conduct extensive experiments to validate the efficiency and robustness of our algorithm. Impressive results have been given on non-IID and IID data, which exhibits a state-of-the-art converge speed (i.e., O(1T)) and superior robustness on stale devices.

We briefly summarize the content of the subsequent sections. Section 2 is concerned with reviewing literature and approaches related to this research. An introduction of the methodology used in our study is given in Section 3, detailing the general form of FL methods in terms of synchronous and asynchronous. In Section 4, we theoretically analyze performance guarantees. Following those guarantees, FedSA and its strategies are discussed in Section 5. We conduct extensive experiments in Section 6. The summary of our findings and potential future works are concluded in Section 7.

Section snippets

Related work

Our work is related to AFL algorithms, and staleness resilience approaches in the similar asynchronous gradient descent (Async-SGD) approaches. Async-SGD can accelerate training performance in near-linear time but is known to have a staleness effect as its primary problem [8]. Specifically, they suffer from 1) staleness introduced by delayed gradients, and 2) higher complexity in model dynamics such as hyperparameter tuning and speed-accuracy trade-offs [9]. Staleness effect on the convergence

Preliminaries and definitions

In this section, we begin by introducing the conventional synchronous FL. Then give the extended form based on this synchronous form, which incorporates synchronous and asynchronous updating schemes. The main notations used in this paper are listed in Table 2.

Suppose there are M local devices and a server in an FL system. We use X(i)=xi,1,xi,2,,xi,mi to denote the local data in device i, where xi,j (j=1,2,3,,mi) is the jth sample in X(i), and mi is the size of data in device i. We denote X=i

Theoretical analysis

We theoretically analyze the extended form of FL, as introduced in Section 3. The overall ideas of the theoretical analysis are as follows. First, we are aware that there exist two training stages by applying error decomposition techniques (Theorem 1. Second, we analyze convergence bounds without/with decayed learning rates (Theorem 2, Theorem 3), and optimal choices of local epochs (Theorem 4) and τ (Theorem 5) in the above stages using the extended form.

FedSA

Under the key results and remarks of the analysis mentioned earlier, we propose FedSA (Federated Staleness-Aware), an efficient and robust AFL algorithm to fight against heterogeneity and staleness. Algorithm 1 and 2 respectively describe the processes in server and devices. Next, we illustrate details of strategies corresponding to the above theorems and remarks.

Experiments

Conclusions and future work

This paper proposes FedSA, a novel AFL algorithm that accelerates convergence and resists performance degradation caused by non-IID data and staleness. Experimental results show that our approach converges fast comparing other methods, and it can resist the staleness effect without sacrificing too much accuracy and communication. Our empirical findings also suggest that our algorithm has the potential to resist malicious attacks such as Byzantine attacks. Finally, our methodologies are

CRediT authorship contribution statement

Ming Chen: Conceptualization, Project administration, Supervision, Writing - original draft, Resources. Bingcheng Mao: Methodology, Writing - original draft, Writing - review & editing, Software, Formal analysis, Visualization. Tianyi Ma: Methodology, Investigation, Project administration, Supervision, Writing - original draft, Writing - review & editing, Software, Validation, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ming Chen received the Ph.D. degree in Computer Science and Technology from Zhejiang University. He is with the College of Computer Science and Technology, Zhejiang University, and serves as the Director of AI Research Institute, Hithink RoyalFlush Information Network Co., Ltd. His current research interests include machine learning, big data and computer vision.

References (29)

  • McMahanB. et al.

    Communication-efficient learning of deep networks from decentralized data

  • SahuA.K. et al.

    On the convergence of federated optimization in heterogeneous networks

    (2018)
  • ChenT. et al.

    LAG: Lazily aggregated gradient for communication-efficient distributed learning

  • C. Xie, S. Koyejo, I. Gupta, Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance, in:...
  • SamarakoonS. et al.

    Distributed federated learning for ultra-reliable low-latency vehicular communications

    (2018)
  • XieC. et al.

    Asynchronous federated optimization

    (2019)
  • K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G.R. Ganger, P.B. Gibbons, O. Mutlu, Gaia: Geo-distributed machine...
  • HakimiI. et al.

    Taming momentum in a distributed asynchronous environment

    (2019)
  • ZhangW. et al.

    Staleness-aware async-SGD for distributed deep learning

  • MitliagkasI. et al.

    Asynchrony begets momentum, with an application to deep learning

  • HadjisS. et al.

    Omnivore: An optimizer for multi-device deep learning on cpus and gpus

    (2016)
  • LianX. et al.

    Asynchronous parallel stochastic gradient for nonconvex optimization

  • ChenJ. et al.

    Revisiting distributed synchronous SGD

    (2016)
  • CuiH. et al.

    Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server

  • Cited by (48)

    • Adaptive asynchronous federated learning

      2024, Future Generation Computer Systems
    • AFAFed—Asynchronous Fair Adaptive Federated learning for IoT stream applications

      2022, Computer Communications
      Citation Excerpt :

      However, unlike AFAFed, in FedAT, we have that: (i) the scaling coefficient of the local penalty terms is statically set and it is the same for all coworkers; (ii) mechanisms for the adaptive tuning of the number of inter-aggregation local model updates are not provided; (iii) model personalization is not enforced; and, (iv) model aggregation is carried out in an age-oblivious way. Similar conclusions apply to the Federated Staleness Aware (FedSA) scheme in [39]. Specifically, by referring to non-i.i.d. heterogeneous FL scenarios, the authors of this contribution reformulate the FL framework by suitably unifying synchronous and (partially) asynchronous aggregation schemes.

    View all citing articles on Scopus

    Ming Chen received the Ph.D. degree in Computer Science and Technology from Zhejiang University. He is with the College of Computer Science and Technology, Zhejiang University, and serves as the Director of AI Research Institute, Hithink RoyalFlush Information Network Co., Ltd. His current research interests include machine learning, big data and computer vision.

    Bingcheng Mao received his B.S. and M.S. degrees in College of Computer Science and Technology from Nanjing University of Aeronautics and Astronautics, China, in 2016 and 2019, respectively. He is currently working with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, machine learning and computer vision.

    Tianyi Ma received the Ph.D. degree in Computer Science and Technology with the Department of Electrical, Computer and Biomedical Engineering from University of Pavia, Italy, 2017. He is a postdoctoral researcher at the College of Computer Science and Technology, Zhejiang University, and work with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning and recommender systems.

    1

    Equal contribution.

    View full text