The success of modern deep learning systems is built on two cornerstones, massive amount of annotated training data and advanced computational infrastructure to support large-scale computation. In recent years, the model size of state-of-the-art deep learning systems has rapidly increased and sometimes reached to billions of parameters. Herein we take a close look into this phenomenon and present an empirical study on the scaling effect of model size for self-supervised speech models. In particular, we investigate the quantitative relationship between the model size and the loss/accuracy performance on speech tasks. First, the power-law scaling property between the number of parameters and the L1 self-supervised loss is verified for speech models. Then the advantage of large speech models in learning effective speech representations is demonstrated in two downstream tasks: i) speaker recognition and ii) phoneme classification. Moreover, it has been shown that the model size of self-supervised speech networks is able to compensate the lack of annotation when there is insufficient training data.
Cite as: Pu, J., Yang, Y., Li, R., Elibol, O., Droppo, J. (2021) Scaling Effect of Self-Supervised Speech Models. Proc. Interspeech 2021, 1084-1088, doi: 10.21437/Interspeech.2021-1935
@inproceedings{pu21_interspeech, author={Jie Pu and Yuguang Yang and Ruirui Li and Oguz Elibol and Jasha Droppo}, title={{Scaling Effect of Self-Supervised Speech Models}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={1084--1088}, doi={10.21437/Interspeech.2021-1935} }