Abstract:
Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representatio...Show MoreMetadata
Abstract:
Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representations contain irrelevant information for a particular downstream task. Hence, we investigate efficient methods to compute reliable representations and discard redundant information for language identification (LID) using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, we compare the performance of wav2vec features extracted from different inner layers of the context network. For this approach, the x-vector self-attention LID (XSA-LID) model forms the backbone used to discriminate between distinct languages. We then propose to employ two mechanisms to reduce irrelevant information of the representations in LID. The first is the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second is the linear bottleneck (LBN) block that reduces irrelevant information by nonlinear dimension reduction. We incorporate these two methods in the XSA-LID model and conduct experiments on AP19-OLR data and the MLS14 data in NIST LRE 2017. By replacing the previous input features with wav2vec 2.0 features, the XSA-LID model achieves 63.79% relative improvement in terms of the average cost on AP19-OLR data, and 40.42%, 41.54% and 18.97% relative improvement on 3 s, 10 s and 30 s test speech in the MLS14 data in NIST LRE 2017, respectively. In addition, the proposed LBN-XSA model achieves 9.85% relative improvement on AP19-OLR data and over 10% overall improvement on the MLS14 data with a modest number of additional parameters compared to the XSA-LID model. Finally, in terms of average cost and accuracy, the proposed LBN-XSA model outperforms the XSA-LID model which adopts the fine-tuned features on the AP19-OLR data.
Published in: IEEE Journal of Selected Topics in Signal Processing ( Volume: 16, Issue: 6, October 2022)