FedSA: A staleness-aware asynchronous Federated Learning algorithm with non-IID data
Introduction
The recently emerged Federated Learning (FL) is a privacy-preserving paradigm for the next-generation Artificial Intelligence (AI) that trains a global model across large-scale devices without disclosing local data. The processes of this paradigm can be summarized as follows: devices first download a global model (or the initialization of a global model) from a server as their local models and train them on their local data. These newly trained models are uploaded to the server. Finally, the server applies weighted averaging (e.g., FedAvg [1]) to aggregate uploaded models into the global model, and broadcasts the new model to devices.
Unfortunately, substantial challenges lay ahead in implementing an efficient and robust FL. First, statistical heterogeneity exists in non-IID (non-Independent and Identically Distributed) data, where data are highly skewed, extremely imbalanced, and vary over devices [1], [2]. This heterogeneity may cause distribution shifts, which necessitates an evaluation of such heterogeneity. Concerning the data protection purpose, sharing data between devices is impractical for FL, thus raising the difficulties of perceiving heterogeneity. Second, unreliable connections and limited computing resources in heterogeneous environments cause environmental heterogeneity, where part of devices are inactive or slow. Studies have shown that the existence of these devices is proven to degrade the performance of the global model [2], [3]. Furthermore, staleness accumulates since the slow devices may keep nonresponse during the training processes, resulting in a lower chance of getting aggregated for local models. We call such effect as staleness effect, which magnifies the degree of statistical heterogeneity since the discrepancies between stale models and the global model may keep increasing. Finally, synchronous methods are too expensive to be used, especially for large-scale devices, as the server has to wait until all or sampled devices are ready [4].
To date, several studies have investigated Asynchronous Federated Learning (AFL), which allows model aggregation without waiting for stale devices [5], [6]. However, introducing asynchrony would aggravate the staleness effect. The distribution of real-world data in devices is usually long-tailed, where “the rich get richer”. For example, data may increase more rapidly in some devices, which requires more processing and training time while computing capabilities remain unchanged. Eventually, the above accumulation in delay affects the convergence of the global model. Furthermore, the bottleneck exists in communication since asynchrony requires more frequent communications between devices and the server [1], [7]. AFL further brings new challenges in the dynamics of hyperparameters and communication-accuracy trade-offs. Finally, the theoretical convergence guarantees of AFL with non-IID data, and practical guidelines for designing efficient algorithms in this concept are still unknown.
Contributions. This paper presents a theoretical analysis and a novel AFL algorithm to alleviate the issues mentioned above. Specifically, our main contributions are summarized as follows:
- •
We extend the form of FL by introducing an asynchronous scheme with an asynchrony-related parameter and provide a theoretical analysis of this new form. Concretely, we suggest new assumptions for non-IIDness, which enables us to measure and bound the impact of non-IIDness using model discrepancies. We analyze and find practical strategies for non-IIDness and staleness. The key results include: 1) a two-stage training strategy to accelerate training; 2) optimal strategies for choosing proper key hyperparameters. To the best of our knowledge, it is the first analysis for AFL in the context of the two-fold heterogeneity with non-IID data, and the first FL approach that identifies and applies the above two-stage training strategy. Our analysis provides not just the performance guarantees but likewise the strategies for designing efficient and robust AFL algorithms.
- •
Following the guidelines and implications from the above analysis, we propose FedSA, a novel asynchronous FL algorithm. It uses the two-stage strategy and dynamically chooses proper hyperparameters by simultaneously considering the two-fold heterogeneity in devices and the similarities among local models.
- •
We conduct extensive experiments to validate the efficiency and robustness of our algorithm. Impressive results have been given on non-IID and IID data, which exhibits a state-of-the-art converge speed (i.e., )) and superior robustness on stale devices.
We briefly summarize the content of the subsequent sections. Section 2 is concerned with reviewing literature and approaches related to this research. An introduction of the methodology used in our study is given in Section 3, detailing the general form of FL methods in terms of synchronous and asynchronous. In Section 4, we theoretically analyze performance guarantees. Following those guarantees, FedSA and its strategies are discussed in Section 5. We conduct extensive experiments in Section 6. The summary of our findings and potential future works are concluded in Section 7.
Section snippets
Related work
Our work is related to AFL algorithms, and staleness resilience approaches in the similar asynchronous gradient descent (Async-SGD) approaches. Async-SGD can accelerate training performance in near-linear time but is known to have a staleness effect as its primary problem [8]. Specifically, they suffer from 1) staleness introduced by delayed gradients, and 2) higher complexity in model dynamics such as hyperparameter tuning and speed-accuracy trade-offs [9]. Staleness effect on the convergence
Preliminaries and definitions
In this section, we begin by introducing the conventional synchronous FL. Then give the extended form based on this synchronous form, which incorporates synchronous and asynchronous updating schemes. The main notations used in this paper are listed in Table 2.
Suppose there are local devices and a server in an FL system. We use to denote the local data in device , where () is the th sample in , and is the size of data in device . We denote
Theoretical analysis
We theoretically analyze the extended form of FL, as introduced in Section 3. The overall ideas of the theoretical analysis are as follows. First, we are aware that there exist two training stages by applying error decomposition techniques (Theorem 1. Second, we analyze convergence bounds without/with decayed learning rates (Theorem 2, Theorem 3), and optimal choices of local epochs (Theorem 4) and (Theorem 5) in the above stages using the extended form.
FedSA
Under the key results and remarks of the analysis mentioned earlier, we propose FedSA (Federated Staleness-Aware), an efficient and robust AFL algorithm to fight against heterogeneity and staleness. Algorithm 1 and 2 respectively describe the processes in server and devices. Next, we illustrate details of strategies corresponding to the above theorems and remarks.
Experiments
Conclusions and future work
This paper proposes FedSA, a novel AFL algorithm that accelerates convergence and resists performance degradation caused by non-IID data and staleness. Experimental results show that our approach converges fast comparing other methods, and it can resist the staleness effect without sacrificing too much accuracy and communication. Our empirical findings also suggest that our algorithm has the potential to resist malicious attacks such as Byzantine attacks. Finally, our methodologies are
CRediT authorship contribution statement
Ming Chen: Conceptualization, Project administration, Supervision, Writing - original draft, Resources. Bingcheng Mao: Methodology, Writing - original draft, Writing - review & editing, Software, Formal analysis, Visualization. Tianyi Ma: Methodology, Investigation, Project administration, Supervision, Writing - original draft, Writing - review & editing, Software, Validation, Resources.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ming Chen received the Ph.D. degree in Computer Science and Technology from Zhejiang University. He is with the College of Computer Science and Technology, Zhejiang University, and serves as the Director of AI Research Institute, Hithink RoyalFlush Information Network Co., Ltd. His current research interests include machine learning, big data and computer vision.
References (29)
- et al.
Communication-efficient learning of deep networks from decentralized data
- et al.
On the convergence of federated optimization in heterogeneous networks
(2018) - et al.
LAG: Lazily aggregated gradient for communication-efficient distributed learning
- C. Xie, S. Koyejo, I. Gupta, Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance, in:...
- et al.
Distributed federated learning for ultra-reliable low-latency vehicular communications
(2018) - et al.
Asynchronous federated optimization
(2019) - K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G.R. Ganger, P.B. Gibbons, O. Mutlu, Gaia: Geo-distributed machine...
- et al.
Taming momentum in a distributed asynchronous environment
(2019) - et al.
Staleness-aware async-SGD for distributed deep learning
- et al.
Asynchrony begets momentum, with an application to deep learning
Omnivore: An optimizer for multi-device deep learning on cpus and gpus
Asynchronous parallel stochastic gradient for nonconvex optimization
Revisiting distributed synchronous SGD
Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server
Cited by (48)
Adaptive asynchronous federated learning
2024, Future Generation Computer SystemsAsynchronous federated learning on heterogeneous devices: A survey
2023, Computer Science ReviewTowards efficient communications in federated learning: A contemporary survey
2023, Journal of the Franklin InstituteA new federated learning-based wireless communication and client scheduling solution for combating COVID-19
2023, Computer CommunicationsAFAFed—Asynchronous Fair Adaptive Federated learning for IoT stream applications
2022, Computer CommunicationsCitation Excerpt :However, unlike AFAFed, in FedAT, we have that: (i) the scaling coefficient of the local penalty terms is statically set and it is the same for all coworkers; (ii) mechanisms for the adaptive tuning of the number of inter-aggregation local model updates are not provided; (iii) model personalization is not enforced; and, (iv) model aggregation is carried out in an age-oblivious way. Similar conclusions apply to the Federated Staleness Aware (FedSA) scheme in [39]. Specifically, by referring to non-i.i.d. heterogeneous FL scenarios, the authors of this contribution reformulate the FL framework by suitably unifying synchronous and (partially) asynchronous aggregation schemes.
Ming Chen received the Ph.D. degree in Computer Science and Technology from Zhejiang University. He is with the College of Computer Science and Technology, Zhejiang University, and serves as the Director of AI Research Institute, Hithink RoyalFlush Information Network Co., Ltd. His current research interests include machine learning, big data and computer vision.
Bingcheng Mao received his B.S. and M.S. degrees in College of Computer Science and Technology from Nanjing University of Aeronautics and Astronautics, China, in 2016 and 2019, respectively. He is currently working with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning, machine learning and computer vision.
Tianyi Ma received the Ph.D. degree in Computer Science and Technology with the Department of Electrical, Computer and Biomedical Engineering from University of Pavia, Italy, 2017. He is a postdoctoral researcher at the College of Computer Science and Technology, Zhejiang University, and work with Hithink RoyalFlush Information Network Co., Ltd. His current research interests include federated learning and recommender systems.
- 1
Equal contribution.