Loading web-font TeX/Main/Regular
One is Not Enough: Parameter-Efficient Fine-Tuning With Multiplicative Sparse Factorization | IEEE Journals & Magazine | IEEE Xplore

One is Not Enough: Parameter-Efficient Fine-Tuning With Multiplicative Sparse Factorization


Abstract:

Fine-tuning gigantic pre-trained models is becoming a canonical paradigm in natural language processing. Unfortunately, as the pre-trained models grow larger, even the co...Show More

Abstract:

Fine-tuning gigantic pre-trained models is becoming a canonical paradigm in natural language processing. Unfortunately, as the pre-trained models grow larger, even the conventional fine-tuning becomes prohibitively resource-consuming. That motivates the recent surge of parameter-efficient fine-tuning methods by selectively updating a small portion of model parameters. Existing methods either customize add-on modules (e.g., adapter, prompter), or refer to weight parameter decomposition which relies on strong structural assumptions (e.g., sparse or low-rank updates) and ad-hoc pre-defined structure parameters (e.g., layerwise sparsities, or the intrinsic rank). Extending the latter line of work, this paper proposes a new weight structured decomposition scheme for parameter-efficient fine-tuning, that is designed to be (i) flexible, covering a much broader matrix family, with sparse or low-rank matrices as special cases; (ii) (nearly) hyperparameter-free, requiring only a global parameter budget as input. This new scheme, dubbed AutoSparse, meets the two goals by factorizing each layer's weight update into a product of multiple sparse matrix factors. Notably, the sparsity levels of all those matrices are automatically allocated (without adopting any heuristic or ad-hoc tuning), through one holistic budget-constrained optimization. It can be solved by the projected gradient descent method that can be painlessly plugged in normal fine-tuning. Extensive experiments and in-depth studies on diverse architectures/tasks like {BERT, RoBERTa, BART}, consistently endorse the superior parameter efficiency of AutoSparse to surpass state-of-the-arts. For instance, AutoSparse with BERT can operate at only 0.5% trainable parameters, while hitting an accuracy of 83.2\% on MNLI-mismatched.
Published in: IEEE Journal of Selected Topics in Signal Processing ( Volume: 18, Issue: 6, September 2024)
Page(s): 1059 - 1069
Date of Publication: 22 July 2024

ISSN Information:


Contact IEEE to Subscribe

References

References is not available for this document.