Abstract
Neural networks typically exhibit permutation symmetry, as reordering neurons in each layer does not change the underlying function they compute. These symmetries contribute to the non-convexity of the networks’ loss landscapes, since linearly interpolating between two permuted versions of a trained network tends to encounter a high loss barrier. Recent work has argued that permutation symmetries are the only sources of non-convexity, meaning there are essentially no such barriers between trained networks if they are permuted appropriately. In this work, we refine these arguments into three distinct claims of increasing strength. We show that existing evidence only supports “weak linear connectivity”—that for each pair of networks belonging to a set of SGD solutions, there exist (multiple) permutations that linearly connect it with the other networks. In contrast, the claim “strong linear connectivity”—that for each network, there exists one permutation that simultaneously connects it with the other networks—is both intuitively and practically more desirable. This stronger claim would imply that the loss landscape is convex after accounting for permutation, and enable linear interpolation between three or more independently trained models without increased loss. In this work, we introduce an intermediate claim—that for certain sequences of networks, there exists one permutation that simultaneously aligns matching pairs of networks from these sequences. Specifically, we discover that a single permutation aligns sequences of iteratively trained as well as iteratively pruned networks, meaning that two networks exhibit low loss barriers at each step of their optimization and sparsification trajectories respectively. Finally, we provide the first evidence that strong linear connectivity may be possible under certain conditions, by showing that barriers decrease with increasing network width when interpolating among three networks.
E. Sharma and D. Kwok—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See Appendix A for details on handling other types of layers.
- 2.
- 3.
These conditions include sufficient width, and the use of layer normalization.
- 4.
In fact, a strictly stronger claim is made: for a certain class of networks \(\mathcal {F}\), for all \(\theta _1 \in \mathcal {F}\), there is a single permutation that can be applied to \(\theta _1\) removing the error barrier between the permuted \(\theta _1\) and any other network in the class \(\mathcal {F}\). Note that this also means that the networks in \(\mathcal {F}\) are piece-wise linearly connected before permuting.
References
Ainsworth, S., Hayase, J., Srinivasa, S.: Git re-basin: merging models modulo permutation symmetries. In: The Eleventh International Conference on Learning Representations (2023)
Akash, A.K., Li, S., Trillos, N.G.: Wasserstein barycenter-based model fusion and linear mode connectivity of neural networks (2022)
Altschuler, J.M., Boix-Adserà, E.: Wasserstein barycenters are NP-hard to compute. SIAM J. Math. Data Sci. 4(1), 179–203 (2022)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)
Benzing, F., et al.: Random initialisations performing above chance and how to find them. In: OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop) (2022)
Entezari, R., Sedghi, H., Saukh, O., Neyshabur, B.: The role of permutation invariance in linear mode connectivity of neural networks. In: International Conference on Learning Representations (2022)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019)
Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M.: Linear mode connectivity and the lottery ticket hypothesis. In: Proceedings of the 37th International Conference on Machine Learning, vol. 119, pp. 3259–3269. PMLR (2020)
Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems (2015)
Jordan, K., Sedghi, H., Saukh, O., Entezari, R., Neyshabur, B.: REPAIR: REnormalizing permuted activations for interpolation repair. In: The Eleventh International Conference on Learning Representations (2023)
Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.: Convergent learning: Do different neural networks learn the same representations? In: Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, vol. 44, pp. 196–212. PMLR (2015)
Lucas, J.R., Bae, J., Zhang, M.R., Fort, S., Zemel, R., Grosse, R.B.: On monotonic linear interpolation of neural network parameters. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 7168–7179. PMLR (2021)
Nagarajan, V., Kolter, J.Z.: Uniform convergence may be unable to explain generalization in deep learning. In: Advances in Neural Information Processing Systems, vol. 32, pp. 11615–11626 (2019)
O’Neill, J., V. Steeg, G., Galstyan, A.: Layer-wise neural network compression via layer fusion. In: Proceedings of the 13th Asian Conference on Machine Learning, vol. 157, pp. 1381–1396. PMLR (2021)
Paul, M., Chen, F., Larsen, B.W., Frankle, J., Ganguli, S., Dziugaite, G.K.: Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? In: The Eleventh International Conference on Learning Representations (2023)
Paul, M., Larsen, B., Ganguli, S., Frankle, J., Dziugaite, G.K.: Lottery tickets on a data diet: Finding initializations with sparse trainable networks. In: Advances in Neural Information Processing Systems, vol. 35, pp. 18916–18928 (2022)
Peña, F.A.G., Medeiros, H.R., Dubail, T., Aminbeidokhti, M., Granger, E., Pedersoli, M.: Re-basin via implicit Sinkhorn differentiation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20237–20246, June 2023
Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6076–6085 (2017)
Simsek, B., Ged, F., Jacot, A., Spadaro, F., Hongler, C., Gerstner, W., Brea, J.: Geometry of the loss landscape in overparameterized neural networks: symmetries and invariances. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 9722–9732. PMLR (2021)
Singh, S.P., Jaggi, M.: Model fusion via optimal transport. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22045–22055 (2020)
Tatro, N., Chen, P.Y., Das, P., Melnyk, I., Sattigeri, P., Lai, R.: Optimizing mode connectivity via neuron alignment. In: Advances in Neural Information Processing Systems, vol. 33, pp. 15300–15311 (2020)
Vlaar, T.J., Frankle, J.: What can linear interpolation of neural network loss landscapes tell us? In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 22325–22341. PMLR (2022)
Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., Khazaeni, Y.: Federated learning with matched averaging. In: International Conference on Learning Representations (2020)
Wortsman, M., Horton, M.C., Guestrin, C., Farhadi, A., Rastegari, M.: Learning neural network subspaces. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 11217–11227. PMLR (2021)
Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., Khazaeni, Y.: Bayesian nonparametric federated learning of neural networks. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 7252–7261. PMLR (2019)
Acknowledgements
The authors would like to thank Tiffany Vlaar and Utku Evci for feedback on a draft and various ideas, as well as Udbhav Bamba for preliminary implementation work. DMR and DR are supported by Canada CIFAR AI Chairs and NSERC Discovery Grants. The authors also acknowledge material support from NVIDIA in the form of computational resources, and are grateful for technical support from the Mila IDT and Vector teams in maintaining the Mila and Vector Compute Clusters. Resources used to prepare this research were provided, in part, by Mila (mila.quebec), the Vector Institute (vectorinstitute.ai), the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sharma, E., Kwok, D., Denton, T., Roy, D.M., Rolnick, D., Dziugaite, G.K. (2024). Simultaneous Linear Connectivity of Neural Networks Modulo Permutation. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14947. Springer, Cham. https://doi.org/10.1007/978-3-031-70368-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-70368-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70367-6
Online ISBN: 978-3-031-70368-3
eBook Packages: Computer ScienceComputer Science (R0)