Residual convolutional neural network with attentive feature pooling for end-to-end language identification from short-duration speech
Introduction
Identifying spoken languages from speech data of unconstrained phonetic content is highly useful across several applications within speech processing. For instance, language recognizers can be employed for hierarchical modelling, where a language recognizer is placed in early stages of the pipeline followed by language-specific sub-modules, as illustrated in Fig. 1. Speech recognition or speaker verification are examples in which prior information about the spoken language can boost performance. Moreover, direct practical applications of language recognizers can include directing calls in call-centers, as an example.
Commonly, language identification (LID), i.e. identifying the spoken language from a given speech sample under the assumption a single language is present, is tackled using similar approaches to speaker verification/recognition (Dehak et al., 2011b). Classical approaches for automatic speaker or language recognition divide the problem into two distinct phases: (i) compute low dimensional representations of input speech data; and (ii) perform binary classification on top of pre-computed representations of enrollment and test utterances. So-called i-vectors (Dehak et al., 2011a) are known to be the state-of-the-art for speaker/language recognition, along with recent methods such as x-vectors (Snyder et al., 2017). Generally, i-vectors are obtained by first computing a universal background model, which is commonly a Gaussian mixture model, followed by factor analysis on top of statistics of the latents with the aim at obtaining a low dimensional representation that embeds both channel- and speaker-dependent information. Classification is performed with Probabilistic linear discriminant analysis (PLDA) (Prince and Elder, 2007). However, the i-vector+PLDA framework presents known shortcomings, such as its lack of robustness to short-duration test recordings.
As in several other fields, neural networks have been largely applied in recent years to substitute some components of speaker/language recognition frameworks, such as generating alternative low-dimensional embeddings or performing recognition in an end-to-end fashion, thus eliminating the need for a post-trained binary classifier. One such example corresponds to x-vectors, which leverage feed-forward neural networks operating in different time scales to compute low-dimensional embeddings from utterances of varying lengths and phonetics. In this approach, a frame-level model consisting of a 5-layer feed-forward neural network is firstly used. Context is provided by adding neighboring frames as inputs. After that, frame-level model outputs are aggregated into a so-called statistical pooling layer. Concatenated first and second order statistics of frame-level outputs are then used as input for a segment-level model, implemented as a 2-layer feed-forward neural network followed by a softmax output layer.
Follow-up approaches, in turn, have extended the idea of including context by employing convolutional neural networks across time (Bhattacharya, Alam, Kenny, 2017, Chung, Nagrani, Zisserman, 2018), i.e. performing 2-dimensional convolutions over time-frequency representations of speech, such that full time-dependent information is taken into account for computation of low-dimensional representations, rather than having only short-term time-dependencies modelled through contextual frames, as is the case for x-vectors.
Training of x-vector and convolutional artificial neural network based systems has generally been performed under the multi-class classification setting, i.e. the model is used as a classifier aiming to identify the speaker/language from a given utterance. The outputs of a final softmax layer thus parameterize a conditional multinoulli distribution over speakers/languages, and parameters are learned via maximum likelihood estimation through minimization of the cross-entropy loss. At test time, outputs of intermediate layers are used as low-dimensional representations on top of which a binary classifier can be trained for verification purposes for open-set conditions, while the outputs are directly used as scores within closed-set testing conditions.
In this work, we propose the use of a residual convolutional neural network for language identification. While doing so, two mechanisms allow for modeling language long-term dependencies: (i) convolutions in the time dimension, and (ii) a self-attention layer (Raffel and Ellis, 2015) which weighs last layer time-steps for weighted statistics pooling. Furthermore, training is particularly tailored to enforcing language dependency on model outputs, and thus triplet loss minimization is performed at train-time along with the maximum likelihood criterion previously described. As will be further discussed, triplet loss minimization enforces representations belonging to examples from the same language to be close, while representations from different languages will be pushed far apart. We evaluate the introduced model and training scheme using a dataset containing recordings of telephone speech from ten oriental languages under different settings, including short-duration of speech and confusing languages, showing relevant improvements in terms of classification performance over strong baselines in all studied test conditions. Moreover, end-to-end evaluation is also carried out showing that directly utilizing model outputs as scores, i.e. discarding post-trained PLDA, outperforms i-vectors+PLDA’s results.
In summary, the main contributions of this work are:
- 1.
We propose a particular convolutional architecture composed of residual blocks along with self-attention mechanisms to model long-term dependencies for learning low-dimensional representations of language.
- 2.
A particular training scheme is devised by making use of metric learning methods, i.e. triplet loss minimization, with the goal of enforcing class-separability in the representations space.
- 3.
Evaluation is performed across different scenarios including short speech duration, as well as open-set conditions, in which case test utterances corresponding to non-target languages not represented within training data are included.
The remainder of this paper is organized as follows: connections with previous literature are discussed in Section 2. Background material is presented in Section 3. Our proposed approach including model description and training scheme is shown in Section 4. Experimental setup along with results and discussion appear in Section 5, while conclusions are drawn in Section 6.
Section snippets
Related work
As pointed out in Dehak et al. (2011b), classical approaches for language identification commonly rely on methods originally introduced for speaker recognition. This is the case for the so-called i-vectors, introduced for fixed-dimensional speaker modeling in Dehak et al. (2011a). Applications of i-vectors in language recognition can be found in Dehak et al. (2011b) and Chung et al. (2018). i-vectors are obtained by firstly training what is usually referred to as universal background model,
Residual learning
Residual architectures have been part of several recent relevant results using convolutional neural networks. Firstly introduced in He et al. (2016), ResNets constitute a set of architectures made up of a series of so-called residual blocks, which determine how a feature transformation should differ from the identity, rather than how it should differ from zero (Bartlett et al., 2018). A residual block transformation, for a generic input X along with a mapping F present a basic form given by:
Proposed model
As mentioned previously, we propose the use of a convolutional architecture aiming to include long-term contextual information at each time-step. This is an inherent feature of stacked convolutional layers (Goodfellow et al., 2016). It is important to highlight that, differently from other approaches that employ causal convolutions for temporal dependency modelling (Bai et al., 2018), the setting explored herein assumes access to the full speech recording for computation of each output
Results and discussion
We evaluate our proposed framework using the dataset introduced for the AP18-OLR Challenge (Tang et al., 2018), which consists of recordings with unconstrained phonetic content of telephone speech in 10 different oriental languages. Information about speaker identity, gender, or age were not utilized, nor was phonetic information. Dataset details are summarized in Table 1.
The AP18-OLR database is divided into three subsets, namely train, development and evaluation sets. We further introduce
Conclusion
In this work, we evaluated the effectiveness of employing residual convolutional neural networks in modelling language dependencies from speech, with the goal of providing deep layers of the model with contextual information from distant time-steps. The used model, a ResNet-50, employs so-called residual blocks, with skip connections which were previously shown to provide better loss landscape conditioning. A self-attention block is employed for temporal pooling weighing representations in
Acknowledgment
The authors wish to acknowledge funding from the National Research Council of Canada (NRC) through the Canadian Indigenous Languages Technology project under contract #909859, and from the Natural Sciences and Engineering Research Council of Canada (NSERC) through contract/grant RGPIN-2016-4175, and RGPAS-493010-2016. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the NRC or NSERC.
References (34)
- Li, H., Xu, Z., Taylor, G., Goldstein, T., 2017. Visualizing the loss landscape of neural nets. arXiv:...
- et al.
Visualizing data using t-sne
J. Mach. Learn. Res.
(2008) - et al.
Tandem features for text-dependent speaker verification on the reddots corpus
Proceedings of the INTERSPEECH
(2016) - Bai, S., Kolter, J. Z., Koltun, V., 2018. An empirical evaluation of generic convolutional and recurrent networks for...
- Bartlett, P. L., Helmbold, D. P., Long, P. M., 2018. Gradient descent with identity initialization efficiently learns...
- et al.
Deep speaker embeddings for short-duration speaker verification
Proceedings of the Interspeech 2017
(2017) - Cai, W., Cai, Z., Liu, W., Wang, X., Li, M., 2018a. Insights into end-to-end learning scheme for language...
- Cai, W., Chen, J., Li, M., 2018b. Exploring the encoding layer and loss function in end-to-end speaker and language...
- et al.
Voxceleb2: Deep speaker recognition
Proceedings of the INTERSPEECH
(2018) - et al.
Front-end factor analysis for speaker verification
IEEE Trans. Audio Speech Lang. Process.
(2011)
Language recognition via i-vectors and dimensionality reduction
Proceedings of the Twelfth Annual Conference of the International Speech Communication Association
Deep Learning
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (18)
Cross-corpora spoken language identification with domain diversification and generalization
2023, Computer Speech and LanguageMulti-level self-attentive TDNN: A general and efficient approach to summarize speech into discriminative utterance-level representations
2022, Speech CommunicationCitation Excerpt :We do so by following the approach discussed in Monteiro et al. (2019c) where the output unit corresponding to the claimed class in a trial is used as a verification score. Regarding data preparation, pre-processing steps follow those discussed in Monteiro et al. (2019c). In the case of embedding encoders aimed at speaker verification tasks, both audio pre-processing and model training closely follow the recipe in Monteiro et al. (2019a), i.e., a multi-task training procedure uses both speaker recognition with the additive margin softmax approach (Wang et al., 2018) and supervised contrastive learning in order to train the embedding encoders.
AKRNet: A novel convolutional neural network with attentive kernel residual learning for feature learning of gearbox vibration signals
2021, NeurocomputingCitation Excerpt :Attentive residual learning is further developed in AKRNet, which transmits the key features into deep structure through attentive skip connection. The literature [30] also proposed an attention mechanism in residual learning, which is implemented by linear transformation. The self-attention is used on the top of the last convolutional layer of the CNN to select the features.
Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers
2020, Computer Speech and LanguageCitation Excerpt :While the number of channels corresponding to the last convolutional layer within the convolutional stack will yield the final representation size, the number of such vectors depends on the input length. Motivated by Monteiro et al. (2019c), we employ self-attentive pooling across time (see Fig. 4) in order to not only enable the model to process varying length inputs, but also to assign higher importance to particular segments of the input, thus combining a set of local descriptors Vi into a single global descriptor V. Two versions of LCNNs are employed, such that a deep LCNN-29 is used in the 1-dimensional case while a LCNN-9 is employed for spectral representations.
Exploring the Potential of Convolutional Neural Networks in Sequential Data Analysis: A Comparative Study with LSTMs and BiLSTMs
2024, 2024 IEEE 22nd World Symposium on Applied Machine Intelligence and Informatics, SAMI 2024 - ProceedingsA Conceptual Framework of Generalizable Automatic Speaker Recognition System Under Varying Speech Conditions
2023, Lecture Notes in Networks and Systems
This paper has been recommended for acceptance by Roger K.Moore.