Skip to main content
Log in

Data-free adaptive structured pruning for federated learning

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Federated learning faces challenges in real-world deployment scenarios due to limited client resources and the problem of stragglers caused by high heterogeneity. Despite efforts to reduce the training and communication overhead of federated learning through model pruning, a uniform pruning ratio fundamentally fails to address the efficiency impact of stragglers in heterogeneous systems. Therefore, adapting the pruned sub-models to individual device capabilities is crucial yet remains under-researched. In this work, we propose AdaPruneFL, a data-free adaptive structured pruning algorithm, which formulates the adaptive pruning problem in federated learning as an optimization problem constrained by aligning the response latency of the client’s local training, to identify an adaptive fine-grained model compression ratio. Combining sequential structured pruning, we extract heterogeneous but aggregable sub-model structures based on the capabilities of client devices, achieving training acceleration in a hardware-friendly manner while mitigating the straggler effect. Our extensive experiments demonstrate that, compared to FedAvg, AdaPruneFL achieves 1.38–3.88x faster training on general-purpose hardware platforms while maintaining comparable convergence accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

  1. https://www.kaggle.com/c/tiny-imagenet.

References

  1. Wang X, Garg S, Lin H, Hu J, Kaddoum G, Piran MJ, Hossain MS (2021) Toward accurate anomaly detection in industrial internet of things using hierarchical federated learning. IEEE Int Things J 9(10):7110–7119

    Article  Google Scholar 

  2. Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F (2021) Federated learning for healthcare informatics. J Healthc Inform Res 5:1–19

    Article  Google Scholar 

  3. Liu Q, Chen C, Qin J, Dou Q, Heng P-A (2021) Feddg: federated domain generalization on medical image segmentation via episodic learning in continuous frequency space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1013–1023

  4. Barbieri L, Savazzi S, Brambilla M, Nicoli M (2022) Decentralized federated learning for extended sensing in 6g connected vehicles. Veh Commun 33:100396

    Google Scholar 

  5. Liang X, Liu Y, Chen T, Liu M, Yang Q (2022) Federated transfer reinforcement learning for autonomous driving. In: Federated and Transfer Learning, pp. 357–371

  6. Yarradoddi S, Gadekallu TR (2022) Federated learning role in big data, jot services and applications security, privacy and trust in jot a survey. In: Trust, Security and Privacy for Big Data, pp. 28–49

  7. McMahan B, Moore E, Ramage D, Hampson S, Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282. PMLR

  8. Jiang Y, Wang S, Valls V, Ko BJ, Lee W-H, Leung KK, Tassiulas L (2022) Model pruning enables efficient federated learning on edge devices. IEEE Trans Neural Netw Learn Syst 12:10374–10386. https://doi.org/10.1109/TNNLS.2022.3166101

    Article  Google Scholar 

  9. Bibikar S, Vikalo H, Wang Z, Chen X (2022) Federated dynamic sparse training: Computing less, communicating less, yet learning better. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6080–6088

  10. Li A, Sun J, Wang B, Duan L, Li S, Chen Y, Li H (2020) Lotteryfl: Personalized and communication-efficient federated learning with lottery ticket hypothesis on non-iid datasets. arXiv preprint arXiv:2008.03371

  11. Qiu X, Fernandez-Marques J, Gusmao PP, Gao Y, Parcollet T, Lane ND (2022) Zerofl: efficient on-device training for federated learning with local sparsity. arXiv preprint arXiv:2208.02507

  12. Diao E, Ding J, Tarokh V (2020) Heterofl: computation and communication efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264

  13. Horvath S, Laskaridis S, Almeida M, Leontiadis I, Venieris S, Lane N (2021) Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout. Adv Neural Inf Process Syst 34:12876–12889

    Google Scholar 

  14. Zhou G, Xu K, Li Q, Liu Y, Zhao Y (2021) Adaptcl: efficient collaborative learning with dynamic and adaptive pruning. arXiv preprint arXiv:2106.14126

  15. Xie C, Koyejo S, Gupta I (2019) Asynchronous federated optimization. arXiv preprint arXiv:1903.03934

  16. Chen Y, Ning Y, Slawski M, Rangwala H (2020) Asynchronous online federated learning for edge devices with non-iid data. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 15–24 . IEEE

  17. Cai Y, Hua W, Chen H, Suh GE, De Sa C, Zhang Z (2022) Structured pruning is all you need for pruning cnns at initialization. arXiv preprint arXiv:2203.02549

  18. Tanaka H, Kunin D, Yamins DL, Ganguli S (2020) Pruning neural networks without any data by iteratively conserving synaptic flow. Adv Neural Inf Process Syst 33:6377–6389

    Google Scholar 

  19. Frankle J, Dziugaite GK, Roy DM, Carbin M (2020) Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576

  20. Su J, Chen Y, Cai T, Wu T, Gao R, Wang L, Lee JD (2020) Sanity-checking pruning methods: Random tickets can win the jackpot. Adv Neural Inf Process Syst 33:20390–20401

    Google Scholar 

  21. Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V (2020) Federated optimization in heterogeneous networks. Proc Mach Learn Syst 2:429–450

    Google Scholar 

  22. Janowsky SA (1989) Pruning versus clipping in neural networks. Phys Rev A 39(12):6600

    Article  Google Scholar 

  23. Molchanov P, Tyree S, Karras T, Aila T, Kautz J (2016) Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440

  24. Hassibi B, Stork D (1992) Second order derivatives for network pruning: Optimal brain surgeon. In: Proceedings of the 5th international conference on Neural Information Processing Systems. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 164–171

  25. Han S, Pool J, Tran J, Dally W (2015) Learning both weights and connections for efficient neural network. In: Proceedings of the 28th international conference on Neural Information Processing Systems. Montreal, Canada, pp 1135–1143

  26. Molchanov P, Mallya A, Tyree S, Frosio I, Kautz J (2019) Importance estimation for neural network pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11264–11272

  27. Frankle J, Carbin M (2018) The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635

  28. Lee N, Ajanthan T, Torr PH (2018) Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340

  29. Wang C, Zhang G, Grosse R (2020) Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376

  30. Liu S, Yu G, Yin R, Yuan J, Qu F (2020) Adaptive batchsize selection and gradient compression for wireless federated learning. In: GLOBECOM 2020-2020 IEEE Global Communications Conference, pp. 1–6. IEEE

  31. Liu S, Yu G, Yin R, Yuan J, Shen L, Liu C (2021) Joint model pruning and device selection for communication-efficient federated edge learning. IEEE Trans Commun 70(1):231–244

    Article  Google Scholar 

  32. Liu X, Wang S, Deng Y, Nallanathan A (2023) Adaptive federated pruning in hierarchical wireless networks. IEEE Trans Wirel Commun. https://doi.org/10.1109/TWC.2023.3329450

    Article  Google Scholar 

  33. Chen Z, Yi W, Shin H, Nallanathan A (2023) Adaptive model pruning for communication and computation efficient wireless federated learning. IEEE Trans Wirel Commun. https://doi.org/10.1109/TWC.2023.3342626

    Article  Google Scholar 

  34. Krizhevsky A, Hinton G et al. (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  35. Tan M, Le QV (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri K, Salakhutdinov R (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. http://proceedings.mlr.press/v97/tan19a.html

  36. Hsieh K, Phanishayee A, Mutlu O, Gibbons PB (2020) The non-iid data quagmire of decentralized machine learning. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 4387–4398. http://proceedings.mlr.press/v119/hsieh20a.html

  37. Wu Y, He K (2018) Group normalization. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds.) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII. Lecture Notes in Computer Science, vol. 11217, pp. 3–19. https://doi.org/10.1007/978-3-030-01261-8_1

  38. Wang L, Xu S, Wang X, Zhu Q (2021) Addressing class imbalance in federated learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10165–10173

  39. Li Q, He B, Song D (2021) Model-contrastive federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10713–10722

  40. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alch’e F, Fox E, Garnett R (eds) Advances in neural information processing systems. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

  41. Series M (2009) Guidelines for evaluation of radio interface technologies for imt-advanced. Rep ITU 638:1–72

    Google Scholar 

Download references

Acknowledgements

This paper is supported by the Strategic Priority Research Program of Chinese Academy of Sciences under grant No. XDA19020102.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Experiment setup details

Data partition We use a Dirichlet (\(\gamma \)) distribution to generate the non-IID data partition among clients, where \(\gamma \) controls the skewness of the data distribution. In our experiments, we partitioned the Cifra-10/100 dataset using \(\gamma =0.5\). Due to the larger number of classes in the TinyImagenet dataset, we used a smaller value of \(\gamma =0.2\) to partition the heterogeneous data distribution. Specifically, the results of data partitioning in different training scenarios are illustrated in Fig. 13:

Fig. 13
figure 13

Data partitioning results for different datasets

Model architecture details Details of the model architecture used in our experiments are shown in Table 9.

Bandwidth allocation In our experiments, we simulate the heterogeneity of device communication capabilities by randomly allocating communication bandwidth within the 4 G range. The detailed client bandwidth allocation is shown in Table 10. We allow random perturbations within 2MB/s to simulate network fluctuations.

Table 9 Model architectures
Table 10 Bandwidth allocation of clients

Solution of the optimization problem

When using a linear form to approximate the relationship between computational complexity and computational latency, combined with equation (4) and equation (5), we have the following optimization problem:

$$\begin{aligned} \begin{aligned} \min _{p_l}-\sum ^N_{l=1}\log {p^c_l} \ \ \textrm{subject to}\, \,&|\mathcal {D}^c|\cdot \left( k \sum ^N_{l=1}\beta _l\cdot p_l^c + b\right) +\frac{\sum ^N_{l=1}\alpha _l\cdot p_l^c}{B^c} \le t_{target}, \\ 0<p^c_l\le 1,\,&\forall 1\le l\le N \end{aligned} \end{aligned}$$
(8)

According to the Lagrange multiplier method, the Lagrangian function for (8) is constructed as:

$$\begin{aligned} \begin{aligned} L(p^c_l, \mu ) = -\sum ^N_{l=1}\log {p^c_l} + \mu&\left( |\mathcal {D}^c|\cdot \left( k \sum ^N_{l=1}\beta _l\cdot p_l^c + b\right) +\frac{\sum ^N_{l=1}\alpha _l\cdot p_l^c}{B^c} - t_{target}\right) \\ 0<p^c_l\le 1,\,&\forall 1\le l\le N \end{aligned} \end{aligned}$$
(9)

where \(\mu \ge 0\) is the Lagrange multiplier. According to the Karush–Kuhn–Tucker (KKT) conditions, we have:

$$\begin{aligned} \begin{aligned}&\frac{\partial L}{\partial p^c_l} = -\frac{1}{p^c_l} + \mu \left( |\mathcal {D}^c| k \beta _l + \frac{\alpha _l}{B^c}\right) = 0\\&\frac{\partial L}{\mu } = |\mathcal {D}^c|\left( k \sum ^N_{l=1}\beta _l\cdot p_l^c + b\right) + \frac{\sum ^N_{l=1}\alpha _l\cdot p_l^c}{B^c} - t_{target} = 0\\&\mu \left( |\mathcal {D}^c|\left( k \sum ^N_{l=1}\beta _l\cdot p_l^c + b\right) + \frac{\sum ^N_{l=1}\alpha _l\cdot p_l^c}{B^c} - t_{target}\right) = 0\\&0<p^c_l\le 1,\ \forall 1\le l\le N \end{aligned} \end{aligned}$$
(10)

By rearranging (10), we can conclude:

$$\begin{aligned} \left\{ \begin{aligned}&p_l=min\left( \frac{B^c}{\mu (|\mathcal {D}^c|\cdot k \cdot \beta _l \cdot B^c+\alpha _l)},1\right) \\&|\mathcal {D}^c|\left( k \sum ^N_{l=1}\beta _lp_l+b\right) +\frac{\sum ^N_{l=1}\alpha _lp_l}{B^c}=t_{target}\\&\mu >0 \end{aligned} \right. \end{aligned}$$
(11)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fan, W., Yang, K., Wang, Y. et al. Data-free adaptive structured pruning for federated learning. J Supercomput 80, 18600–18626 (2024). https://doi.org/10.1007/s11227-024-06162-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06162-1

Keywords