Skip to main content

Simultaneous Linear Connectivity of Neural Networks Modulo Permutation

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2024)

Abstract

Neural networks typically exhibit permutation symmetry, as reordering neurons in each layer does not change the underlying function they compute. These symmetries contribute to the non-convexity of the networks’ loss landscapes, since linearly interpolating between two permuted versions of a trained network tends to encounter a high loss barrier. Recent work has argued that permutation symmetries are the only sources of non-convexity, meaning there are essentially no such barriers between trained networks if they are permuted appropriately. In this work, we refine these arguments into three distinct claims of increasing strength. We show that existing evidence only supports “weak linear connectivity”—that for each pair of networks belonging to a set of SGD solutions, there exist (multiple) permutations that linearly connect it with the other networks. In contrast, the claim “strong linear connectivity”—that for each network, there exists one permutation that simultaneously connects it with the other networks—is both intuitively and practically more desirable. This stronger claim would imply that the loss landscape is convex after accounting for permutation, and enable linear interpolation between three or more independently trained models without increased loss. In this work, we introduce an intermediate claim—that for certain sequences of networks, there exists one permutation that simultaneously aligns matching pairs of networks from these sequences. Specifically, we discover that a single permutation aligns sequences of iteratively trained as well as iteratively pruned networks, meaning that two networks exhibit low loss barriers at each step of their optimization and sparsification trajectories respectively. Finally, we provide the first evidence that strong linear connectivity may be possible under certain conditions, by showing that barriers decrease with increasing network width when interpolating among three networks.

E. Sharma and D. Kwok—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See Appendix A for details on handling other types of layers.

  2. 2.

    Weight matching is approximate, as finding the actual optimum is NP-hard [1, 3].

  3. 3.

    These conditions include sufficient width, and the use of layer normalization.

  4. 4.

    In fact, a strictly stronger claim is made: for a certain class of networks \(\mathcal {F}\), for all \(\theta _1 \in \mathcal {F}\), there is a single permutation that can be applied to \(\theta _1\) removing the error barrier between the permuted \(\theta _1\) and any other network in the class \(\mathcal {F}\). Note that this also means that the networks in \(\mathcal {F}\) are piece-wise linearly connected before permuting.

References

  1. Ainsworth, S., Hayase, J., Srinivasa, S.: Git re-basin: merging models modulo permutation symmetries. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  2. Akash, A.K., Li, S., Trillos, N.G.: Wasserstein barycenter-based model fusion and linear mode connectivity of neural networks (2022)

    Google Scholar 

  3. Altschuler, J.M., Boix-Adserà, E.: Wasserstein barycenters are NP-hard to compute. SIAM J. Math. Data Sci. 4(1), 179–203 (2022)

    Article  MathSciNet  Google Scholar 

  4. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)

    Google Scholar 

  5. Benzing, F., et al.: Random initialisations performing above chance and how to find them. In: OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop) (2022)

    Google Scholar 

  6. Entezari, R., Sedghi, H., Saukh, O., Neyshabur, B.: The role of permutation invariance in linear mode connectivity of neural networks. In: International Conference on Learning Representations (2022)

    Google Scholar 

  7. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019)

    Google Scholar 

  8. Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M.: Linear mode connectivity and the lottery ticket hypothesis. In: Proceedings of the 37th International Conference on Machine Learning, vol. 119, pp. 3259–3269. PMLR (2020)

    Google Scholar 

  9. Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems (2015)

    Google Scholar 

  10. Jordan, K., Sedghi, H., Saukh, O., Entezari, R., Neyshabur, B.: REPAIR: REnormalizing permuted activations for interpolation repair. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  11. Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.: Convergent learning: Do different neural networks learn the same representations? In: Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, vol. 44, pp. 196–212. PMLR (2015)

    Google Scholar 

  12. Lucas, J.R., Bae, J., Zhang, M.R., Fort, S., Zemel, R., Grosse, R.B.: On monotonic linear interpolation of neural network parameters. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 7168–7179. PMLR (2021)

    Google Scholar 

  13. Nagarajan, V., Kolter, J.Z.: Uniform convergence may be unable to explain generalization in deep learning. In: Advances in Neural Information Processing Systems, vol. 32, pp. 11615–11626 (2019)

    Google Scholar 

  14. O’Neill, J., V. Steeg, G., Galstyan, A.: Layer-wise neural network compression via layer fusion. In: Proceedings of the 13th Asian Conference on Machine Learning, vol. 157, pp. 1381–1396. PMLR (2021)

    Google Scholar 

  15. Paul, M., Chen, F., Larsen, B.W., Frankle, J., Ganguli, S., Dziugaite, G.K.: Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  16. Paul, M., Larsen, B., Ganguli, S., Frankle, J., Dziugaite, G.K.: Lottery tickets on a data diet: Finding initializations with sparse trainable networks. In: Advances in Neural Information Processing Systems, vol. 35, pp. 18916–18928 (2022)

    Google Scholar 

  17. Peña, F.A.G., Medeiros, H.R., Dubail, T., Aminbeidokhti, M., Granger, E., Pedersoli, M.: Re-basin via implicit Sinkhorn differentiation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20237–20246, June 2023

    Google Scholar 

  18. Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6076–6085 (2017)

    Google Scholar 

  19. Simsek, B., Ged, F., Jacot, A., Spadaro, F., Hongler, C., Gerstner, W., Brea, J.: Geometry of the loss landscape in overparameterized neural networks: symmetries and invariances. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 9722–9732. PMLR (2021)

    Google Scholar 

  20. Singh, S.P., Jaggi, M.: Model fusion via optimal transport. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22045–22055 (2020)

    Google Scholar 

  21. Tatro, N., Chen, P.Y., Das, P., Melnyk, I., Sattigeri, P., Lai, R.: Optimizing mode connectivity via neuron alignment. In: Advances in Neural Information Processing Systems, vol. 33, pp. 15300–15311 (2020)

    Google Scholar 

  22. Vlaar, T.J., Frankle, J.: What can linear interpolation of neural network loss landscapes tell us? In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 22325–22341. PMLR (2022)

    Google Scholar 

  23. Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., Khazaeni, Y.: Federated learning with matched averaging. In: International Conference on Learning Representations (2020)

    Google Scholar 

  24. Wortsman, M., Horton, M.C., Guestrin, C., Farhadi, A., Rastegari, M.: Learning neural network subspaces. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 11217–11227. PMLR (2021)

    Google Scholar 

  25. Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., Khazaeni, Y.: Bayesian nonparametric federated learning of neural networks. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 7252–7261. PMLR (2019)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank Tiffany Vlaar and Utku Evci for feedback on a draft and various ideas, as well as Udbhav Bamba for preliminary implementation work. DMR and DR are supported by Canada CIFAR AI Chairs and NSERC Discovery Grants. The authors also acknowledge material support from NVIDIA in the form of computational resources, and are grateful for technical support from the Mila IDT and Vector teams in maintaining the Mila and Vector Compute Clusters. Resources used to prepare this research were provided, in part, by Mila (mila.quebec), the Vector Institute (vectorinstitute.ai), the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ekansh Sharma , Devin Kwok or Gintare Karolina Dziugaite .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2161 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sharma, E., Kwok, D., Denton, T., Roy, D.M., Rolnick, D., Dziugaite, G.K. (2024). Simultaneous Linear Connectivity of Neural Networks Modulo Permutation. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14947. Springer, Cham. https://doi.org/10.1007/978-3-031-70368-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70368-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70367-6

  • Online ISBN: 978-3-031-70368-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics