Dilated residual networks with multi-level attention for speaker verification

doi:10.1016/j.neucom.2020.06.079

Neurocomputing

Volume 412, 28 October 2020, Pages 177-186

https://doi.org/10.1016/j.neucom.2020.06.079 Get rights and content

Abstract

With the development of deep learning techniques, speaker verification (SV) systems based on deep neural network (DNN) achieve competitive performance compared with traditional i-vector-based works. Previous DNN-based SV methods usually employ time-delay neural network, limiting the extension of the network for an effective representation. Besides, existing attention mechanisms used in DNN-based SV systems are only applied to a single level of network architectures, leading to insufficiently extraction of important features.

To address above issues, we propose an effective deep speaker embedding architecture for SV, which combines a residual connection of one-dimensional dilated convolutional layers, called dilated residual networks (DRNs), with a multi-level attention model. The DRNs can not only capture long time-frequency context information of features, but also exploit information from multiple layers of DNN. In addition, the multi-level attention model, which consists of two-dimensional convolutional block attention modules employed at the frame level and the vector-based attention utilized at the pooling layer, can emphasize important features at multiple levels of DNN. Experiments conducted on NIST SRE 2016 dataset show that the proposed architecture achieves a superior equal error rate (EER) of 7.094% and a better detection cost function (DCF16) of 0.552 compared with state-of-the-art methods. Furthermore, the ablation experiments demonstrate the effectiveness of dilated convolutions and the multi-level attention on SV tasks.

Introduction

Speaker verification (SV) serves as a common speech-related research topic to determine if a given speech utterance belongs to a specific speaker, which can be applied in various fields, such as criminal investigation, judicial forensics and telephone identification. According to the difference of recording texts in speech utterances, SV can be generally classified into two typical categories: text-dependent SV and text-independent SV. Since text-independent SV is more challenging and has greater practical significance [1], [2], this work mainly focuses on text-independent SV which has no phonetic restrictions on the recording texts.

Over the past decade, i-vector-based representation [3] using a probabilistic linear discriminant analysis (PLDA) backend classifier [4] dominates the SV area, which consists of a gaussian mixture model-universal background model (GMM-UBM) to collect statistics and a large projection matrix to map the high-dimensional statistics into a low-dimensional representation.

Recently, with the development of deep learning techniques, a lot of works introduced deep neural network (DNN) into SV, and achieved competitive performance compared with traditional i-vector-based works. In detail, the DNN-based SV methods exploit a deep speaker learning architecture instead of the i-vector model to extract speaker representations. The process can be divided into three phases [5] in Fig. 1, which is also adopted in this work:

•
Training phase: First, the utterances in training dataset are cut into segments with the length of 2 to 4 s as training samples of the DNN. The samples are extracted into acoustic features, e.g., mel frequency cepstral coefficients (MFCC), linear predictive cepstral coefficients (LFCC) and enhanced teager energy cepstral coefficients (ETECC) [6]. Then, the DNN is trained with acoustic features to discriminate between speakers with a loss function.
•
Enrollment phase: The acoustic features of integral utterances in training dataset are fed into the trained DNN to extract speaker-dependent bottleneck features, known as speaker embeddings. The speaker embeddings in enrolled speaker dataset are used to train a separate PLDA classifier model for the following test phase.
•
Test phase: The test utterances are evaluated with the DNN embeddings and the PLDA classifier.

In DNN-based SV systems, it is essential to construct an efficient deep embedding learning architecture for discriminating between speakers. The architecture can be bulit by time-delay neural network (TDNN) [7], [8], convolutional neural network (CNN) [9], [10] or long short-term memory network (LSTM) [11]. Among those methods, x-vector [8] using a TDNN architecture, which has superior performance than traditional i-vector, becomes the state-of-the-art DNN-based method in current SV works.

The x-vector architecture contains three typical parts: five TDNN layers for frame-level processing, a statistics pooling layer which aggregates the frame-level outputs across all frames, and two fully-connected (FC) layers for utterance-level processing. Although x-vector shows great performance on SV, the TDNN layers in x-vector limit the performance improvement. The works in [12], [13] show that exploiting information from multiple layers is beneficial for an effective representation since there exists a gradual transition from low to high layers in a DNN. However, it drastically increases computational complexity to concatenate the outputs of five high-dimensional TDNN layers in x-vector (512, 512, 512, 512, and 1500, respectively). In addition, the frequency dimension of input acoustic features is set only between 20 and 30, leading to a huge difference in dimensions between the input and the first TDNN layer, and resulting in leakage of corresponding feature information.

In recent years, to make the networks focus on important features, attention mechanisms have been gradually introduced to the DNN-based SV systems and applied to different levels of DNN architecture, such as the pooling layer [14], [15], [16], [17], the frame level [1] and the FC layer [18]. However, those works only employ the attention mechanism at a single level of the DNN and emphasize important features along a single dimension, ignoring the impacts from multiple levels and other feature dimensions.

Based on above observations, we propose an effective deep embedding learning architecture called Dilated Residual Networks with Multi-level Attention (DRES-MA for short), which combines dilated residual networks with a multi-level attention model, for the SV tasks. Inspired by traditional x-vector, we utilize one-dimensional dilated convolutional layers to capture long time-frequency context information for frame-level features. In order to exploit information from multiple layers, the residual connection [19] is employed to build the dilated residual networks. Besides, a multi-level attention model is proposed to focus on important features at multiple network levels, including the frame level and the pooling layer. At the frame level, we propose a two-dimensional convolutional block attention module (CBAM2D) to capture important features along temporal and channel axes. At the pooling layer, we introduce a vector-based attention (VA) mechanism and incorporate it into a weighted statistics pooling layer, called the VA-based pooling layer. The CBAM2D and VA-based pooling layer form the multi-level attention model, applying attention mechanisms to multiple network levels. Experiments conducted on NIST SRE 2016 datasets [20] verify the effectiveness of the proposed architecture. The DRES-MA architecture achieves an equal error rate (EER) of 7.094% and a detection cost function 2016 (DCF16) score of 0.552. Compared to the x-vector baseline, DRES-MA improves the EER by 17% and DCF16 score by 9% with little increase of computational cost.

The rest parts are organized as follows. Section 2 introduces related works of DNN-based SV methods. Section 3 describes the network construction of our proposed DRES-MA architecture. Section 4 presents the experimental setup. Experimental results and analysis of the proposed techniques are presented in Section 5. Section 6 concludes this work.

Section snippets

Related work

In recent years, with the development of deep learning and data augmentation techniques, the DNN-based SV methods have become popular and achieved competitive performance compared with traditional i-vector method. Previous works mainly improved the performance of DNN-based SV methods from three aspects: network architecture, loss function and attention mechanism.

For network architecture, the x-vector proposed by Snyder et al. [8], utilized TDNN to achieve superior performance compared with the

Overview

In this section, we propose a new deep speaker embedding architecture called Dilated Residual Networks with Multi-level Attention (DRES-MA for short), shown in Fig. 2. The proposed architecture consists of three parts: frame-level part, pooling layer and utterance-level part. In the frame-level part, we utilize 16 dilated residual blocks combined with two-dimensional convolutional block attention modules (CBAM2D) to deal with the frame-level features. At the pooling layer, the vector-based

Dataset

The training dataset consists of English telephone speech, collected from NIST speaker recognition evaluations (SRE) [34] and Switchboard datasets [35]. The test dataset is NIST 2016 speaker recognition evaluations (SRE16) [20]. The SRE training datasets consist of NIST SREs data in 2005, 2006 [36] and 2008 [37], while the Switchboard datasets are made up of Switchboard 2 Phase 1, 2, 3 and Switchboard Cellular, which contains about 28 k recordings from 2.6 k speakers. In total, there are 4343

Overall results

The experimental results with various systems are summerized in Table 3. As illustrated in Section 4.1, the SRE16 test set consists of two languages: Tagalog and Cantonese. Hence, the test set is divided into three parts: Tagalog, Cantonese and the combination of the two languages (All). Except for EER of Tagalog, the DRES-MA system achieves better performance than the three baseline systems. Especially, it achieves the best EER of 7.094% and the best DCF16 score of 0.552 in SRE16 All set.

Conclusion

In this paper, an effective deep embedding architecture DRES-MA is proposed for text-independent speaker verification, which combines dilated residual networks with a multi-level attention model. The dilated residual networks not only capture long time-frequency context information of frame-level features, but also exploit information from multiple layers. In addition, the multi-level attention model can emphasize important features at multiple levels of DNN architecture. Experimental results

CRediT authorship contribution statement

Yanfeng Wu: Conceptualization, Methodology, Software, Writing - original draft. Chenkai Guo: Conceptualization, Methodology, Writing - review & editing. Hongcan Gao: Writing - review & editing, Resources. Jing Xu: Supervision, Writing - review & editing. Guangdong Bai: Data curation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported by Science and Technology Planning Project of Tianjin, China (Grant No. 18ZXZNGX00310), Tianjin Natural Science Foundation (Grant No. 17JCZDJC30700 and 19JCQNJC00300), and Fundamental Research Funds for the Central Universities of Nankai University (Grant No. 6319140).

Yanfeng Wu is a Ph.D. candidate in the College of Artificial Intelligence of Nankai University. He received the B.E. degree from Nankai University in 2017. His research interests include speaker recognition and sound classification.

References (43)

T. Bian et al.
Self-attention based speaker recognition using cluster-range loss
Neurocomputing
(2019)
S. Novoselov, A. Shulipa, I. Kremnev, A. Kozlov, V. Shchemelinin, On deep speaker embeddings for text-independent...
N. Dehak et al.
Front-end factor analysis for speaker verification
IEEE Trans. Audio, Speech, Language Process.
(2011)
S. Ioffe
Probabilistic linear discriminant analysis
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint...
R. Acharya, H.A. Patil, H. Kotta, Novel enhanced teager energy based cepstral coefficients for replay spoof detection,...
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker...
D. Snyder et al.
X-vectors: Robust dnn embeddings for speaker recognition
A. Torfi, J. Dawson, N.M. Nasrabadi, Text-independent speaker verification using 3d convolutional neural networks, in:...
Z. Gao, Y. Song, I. McLoughlin, W. Guo, L. Dai, An improved deep embedding learning method for short duration speaker...

L. Wan et al.

Generalized end-to-end loss for speaker verification

J. Yosinski et al.

How transferable are features in deep neural networks?

Y. Jiang, Y. Song, I. McLoughlin, Z. Gao, L.-R. Dai, An Effective Deep Embedding Learning Architecture for Speaker...

K. Okabe, T. Koshinaka, K. Shinoda, Attentive statistics pooling for deep speaker embedding, in: Proc. Interspeech...

Y. Zhu, T. Ko, D. Snyder, B. Mak, D. Povey, Self-attentive speaker embeddings for text-independent speaker...

G. Bhattacharya, M.J. Alam, V. Gupta, P. Kenny, Deeply fused speaker embeddings for text-independent speaker...

Y. Liu, L. He, W. Liu, J. Liu, Exploring a unified attention-based pooling framework for speaker verification, in: 2018...

G. Bhattacharya, J. Alam, P. Kenny, Deep speaker embeddings for short-duration speaker verification, in: Proc....

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: The IEEE Conference on Computer...

S.O. Sadjadi, T. Kheyrkhah, A. Tong, C. Greenberg, D. Reynolds, E. Singer, L. Mason, J. Hernandez-Cordero, The 2016...

Y. Tang et al.

Deep speaker embedding learning with multi-level pooling for text-independent speaker verification

Cited by (17)

Emotion embedding framework with emotional self-attention mechanism for speaker recognition
2024, Expert Systems with Applications
The emotional states of speech have a great impact on the efficiency of speaker recognition (SR) system. Many researchers focus on how to map speech with different emotions to an emotion invariant embedding, which reduces the diversity of data. This paper proposes a new emotion embedding framework with self-attention mechanism for speaker recognition. First, several deep neural networks (DNNs) are trained to classify speakers in different emotional states as emotion embedding extractors during development phase. Then at enrollment stage, these pre-trained models are used to extend emotion embeddings from neutral speech. In order to make the final speaker embedding more representative, the classification model is trained with self-attention mechanism in emotion dimension, so that the framework can automatically annotate the weights of the emotion embeddings. Experiments were carried out on both Mandarin Affective Speech Corpus (MASC) and Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D). The results show the proposed method achieves the best of Identification Rate (IR) and Equal Error Rate (EER) which are 59.14%, 15.79% on MASC and 75.98%, 8.14% on CREMA-D compared with state-of-the-art methods. In addition, the cross-database experiments also further demonstrate the practicability of the method in real scenes.
Multi-level attention network: Mixed time–frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition
2024, Engineering Applications of Artificial Intelligence
In this paper, we propose a more efficient lightweight speaker recognition network, the multi-level attention network (MANet). MANet aims to generate more robust and discriminative speaker features by emphasizing features at different levels in the speaker recognition network through multi-level attention. The multi-level attention contains mixed time–frequency channel (MTFC) attention and multi-scale self-attentive standard deviation pooling (MSSDP). MTFC attention combines channel, time, and frequency information to capture global features and model long-term contexts. MSSDP can capture changes in frame-level features and aggregate frame-level features with different scales, generating a long-term, robust, and discriminative utterance-level feature. Therefore, MANet emphasizes the features of different levels. We performed extensive experiments on two popular datasets, Voxceleb and CN-Celeb. The proposed method is compared with the current state-of-the-art speaker recognition methods. It achieved EER/minDCF of 1.82%/0.1965, 1.94%/0.2059, 3.69%/0.3626, and 11.98%/0.4814 on the test sets Voxceleb1-O, Voxceleb1-E, Voxceleb1-H, and CN-Celeb, respectively. It is a more effective lightweight speaker recognition network, superior to most large speaker recognition networks and all lightweight speaker recognition networks tested, with an improved performance of 64% compared to the baseline system ThinResNet-34. Compared to the lightest EfficientTDNN-Small, it has only 0.6 million more parameters but 63% better performance. The performance of MANet is only 4% different compared to the state-of-the-art large model LE-Conformer. In the ablation experiments, our proposed attention method and aggregation model achieved the best experimental performance in Voxceleb1-O with EER/minDCF of 2.46%/0.2708, 2.39%/0.2417, respectively, which indicates that our proposed methods are a significant improvement over previous methods.
End-to-end deep speaker embedding learning using multi-scale attentional fusion and graph neural networks
2023, Expert Systems with Applications
As an attractive research in biometric authentication, Text Independent Speaker Verification (TI-SV) problem aims to specify whether two given unconstrained utterances come from the same speaker or not. As state-of-the-art solutions, end-to-end approaches using deep neural networks seek to learn a highly discriminative speaker embedding space.
In this paper, we propose a novel end-to-end approach for speaker embedding learning by focusing on two crucial factors: speaker embedder architecture and objective function. The proposed module in the speaker embedder is composed of an Efficient Multi-resolution feature Representation (EMR) block followed by a Multi-scale Channel Attention Fusion (MCAF) block. The EMR effectively addresses the issue of fixed resolution convolutional kernels which commonly used in most embedder architectures. Moreover, the MCAF significantly improves the simple summation-based feature fusion used in residual embedder networks. Regarding the objective function, we conduct the speaker embedding space towards learning the embedding-to-embedding relations, in addition to only embedding-to-training class relations employed by most previous methods. So, we propose to employ a dynamic graph attention network, on top of the proposed embedder to learn all informative relations between embeddings, and then learn both embedder and graph-based networks in an end-to-end manner.
We conduct various experiments on a large-scale benchmark dataset called VoxCeleb1&2. The effectiveness of all proposed components is verified through an ablation study. We show the superior or competitive performances of the proposed approach compared to seven well-known embedding architectures and 32 SV systems, regarding two evaluation metrics, EER and minDCF, as well as the number of embedder parameters.
A novel computer-assisted diagnosis method of knee osteoarthritis based on multivariate information and deep learning model[Formula presented]
2023, Digital Signal Processing: A Review Journal
Knee Osteoarthritis (KOA) is a chronic joint disease characterized by degeneration of knee's articular cartilage. Comparing with the imaging diagnostic method, the vibroarthrographic (VAG) may be emerged as a new candidate for diagnosis of KOA in clinics. However, it is challenging for doctors to evaluate patients' condition by visually detecting VAGs due to the limited understanding about contained information. Besides, the basic physiological signals are also closely correlated with a greater risk of KOA. Based on this, we focus on studying computer-assisted diagnosis method of KOA (KOA-CAD) using multivariate information (i.e. VAGs and basic physiological signals) based on an improved deep learning model (DLM). Firstly, a new Laplace distribution-based strategy (LD-S) for classification in DLM is designed. Secondly, an aggregated multiscale dilated convolution network (AMD-CNN) is constructed to learn features from multivariate information of KOA patients. Then, a new KOA-CAD method is proposed by integrating the AMD-CNN with the LD-S to realize three CAD objectives, including the automatic KOA detection, the KOA early detection, and the KOA grading detection. Finally, the multivariate information collected clinically is applied to verify the proposed method, where the automatic detection accuracy, early detection accuracy and grading detection accuracy are 93.6%, 92.1%, and 84.2% respectively.
RSKNet-MTSP: Effective and portable deep architecture for speaker verification
2022, Neurocomputing
Citation Excerpt :
The former extract frame-level features for each frame and generate a two-dimensional output with time dimension and channel dimension. Time-delay neural network (TDNN) [5–9] and one-dimensional convolutional neural network (CNN) along the time axis, are representative frame-level structures. The segment-level structures [10–15] consider the input acoustic features as a grayscale image which has three dimensions for time, frequency and channel, respectively, and employ two-dimensional CNN to produce three-dimensional outputs.
The convolutional neural network (CNN) based approaches have shown great success for speaker verification (SV) tasks, where modeling long temporal context and reducing information loss of speaker characteristics are two important challenges significantly affecting the verification performance. Previous works have introduced dilated convolution and multi-scale aggregation methods to address above challenges. However, such methods are also hard to make full use of some valuable information, which make it difficult to substantially improve the verification performance. To address above issues, we construct a novel CNN-based architecture for SV, called RSKNet-MTSP, where a residual selective kernel block (RSKBlock) and a multiple time-scale statistics pooling (MTSP) module are proposed. The RSKNet-MTSP can capture both long temporal context and neighbouring information, and gather more speaker-discriminative information from multi-scale features. In order to design a portable model for real applications with limited resources, we then present a lightweight version of RSKNet-MTSP, namely RSKNet-MTSP-L, which employs a combination technique associating the depthwise separable convolutions with low-rank factorization of weight matrices. Extensive experiments are conducted on two public SV datasets, VoxCeleb and Speaker in the Wild (SITW). The results demonstrate that 1) RSKNet-MTSP significantly outperforms the state-of-the-art deep embedding architectures on all five test sets; 2) RSKNet-MTSP-L achieves competitive performance with 17%-44% less network parameters compared to baseline models. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods.
Speaker verification using attentive multi-scale convolutional recurrent network
2022, Applied Soft Computing
Citation Excerpt :
The work in this paper focuses on discussing speaker verification only. Many works are done on speaker verification [10–25]. The efforts in these works mainly concentrate on two questions.
In this paper, we propose a speaker verification method by an Attentive Multi-scale Convolutional Recurrent Network (AMCRN). The proposed AMCRN can acquire both local spatial information and global sequential information from the input speech recordings. In the proposed method, logarithm Mel spectrum is extracted from each speech recording and then fed to the proposed AMCRN for learning speaker embedding. Afterwards, the learned speaker embedding is fed to the back-end classifier (such as cosine similarity metric) for scoring in the testing stage. The proposed method is compared with state-of-the-art methods for speaker verification. Experimental data are three public datasets that are selected from two large-scale speech corpora (VoxCeleb1 and VoxCeleb2). Experimental results show that our method exceeds baseline methods in terms of equal error rate and minimal detection cost function, and has advantages over most of baseline methods in terms of computational complexity and memory requirement. In addition, our method generalizes well across truncated speech segments with different durations, and the speaker embedding learned by the proposed AMCRN has stronger generalization ability across two back-end classifiers.

View all citing articles on Scopus

Chenkai Guo received the Ph.D. degree from Nankai University in 2017. He is an assistant professor in the College of Computer Science, Nankai University. His research interests include software analysis on mobile apps, information security, and intelligent software engineering.

Hongcan Gao is a Ph.D. candidate in the College of Compute r Science of Nankai University. She received the M.S. degree from Hebei University of Technology in 2017. Her research interests include software analysis on mobile apps and software security.

Jing Xu received the Ph.D. degree from Nankai University in 2003. She is a professor in the College of Artificial Intelligence, Nankai University. She received the Second Prize of the Tianjin Science and Technology Progress Award twice, in 2017 and 2018, respectively. Currently, her research interests include intelligent software engineering and medical data analysis.

Guangdong Bai received the bachelor’s and master’s degrees in computing science from Peking University, China, in 2008 and 2011, respectively, and the Ph.D. degree in computing science from the National University of Singapore in 2015. He is now a Senior Lecturer in the University of Queensland. His research interests include cyber security, software engineering and machine learning.

¹: Equal contribution.

View full text