Elsevier

Neurocomputing

Volume 412, 28 October 2020, Pages 177-186
Neurocomputing

Dilated residual networks with multi-level attention for speaker verification

https://doi.org/10.1016/j.neucom.2020.06.079Get rights and content

Abstract

With the development of deep learning techniques, speaker verification (SV) systems based on deep neural network (DNN) achieve competitive performance compared with traditional i-vector-based works. Previous DNN-based SV methods usually employ time-delay neural network, limiting the extension of the network for an effective representation. Besides, existing attention mechanisms used in DNN-based SV systems are only applied to a single level of network architectures, leading to insufficiently extraction of important features.

To address above issues, we propose an effective deep speaker embedding architecture for SV, which combines a residual connection of one-dimensional dilated convolutional layers, called dilated residual networks (DRNs), with a multi-level attention model. The DRNs can not only capture long time-frequency context information of features, but also exploit information from multiple layers of DNN. In addition, the multi-level attention model, which consists of two-dimensional convolutional block attention modules employed at the frame level and the vector-based attention utilized at the pooling layer, can emphasize important features at multiple levels of DNN. Experiments conducted on NIST SRE 2016 dataset show that the proposed architecture achieves a superior equal error rate (EER) of 7.094% and a better detection cost function (DCF16) of 0.552 compared with state-of-the-art methods. Furthermore, the ablation experiments demonstrate the effectiveness of dilated convolutions and the multi-level attention on SV tasks.

Introduction

Speaker verification (SV) serves as a common speech-related research topic to determine if a given speech utterance belongs to a specific speaker, which can be applied in various fields, such as criminal investigation, judicial forensics and telephone identification. According to the difference of recording texts in speech utterances, SV can be generally classified into two typical categories: text-dependent SV and text-independent SV. Since text-independent SV is more challenging and has greater practical significance [1], [2], this work mainly focuses on text-independent SV which has no phonetic restrictions on the recording texts.

Over the past decade, i-vector-based representation [3] using a probabilistic linear discriminant analysis (PLDA) backend classifier [4] dominates the SV area, which consists of a gaussian mixture model-universal background model (GMM-UBM) to collect statistics and a large projection matrix to map the high-dimensional statistics into a low-dimensional representation.

Recently, with the development of deep learning techniques, a lot of works introduced deep neural network (DNN) into SV, and achieved competitive performance compared with traditional i-vector-based works. In detail, the DNN-based SV methods exploit a deep speaker learning architecture instead of the i-vector model to extract speaker representations. The process can be divided into three phases [5] in Fig. 1, which is also adopted in this work:

  • Training phase: First, the utterances in training dataset are cut into segments with the length of 2 to 4 s as training samples of the DNN. The samples are extracted into acoustic features, e.g., mel frequency cepstral coefficients (MFCC), linear predictive cepstral coefficients (LFCC) and enhanced teager energy cepstral coefficients (ETECC) [6]. Then, the DNN is trained with acoustic features to discriminate between speakers with a loss function.

  • Enrollment phase: The acoustic features of integral utterances in training dataset are fed into the trained DNN to extract speaker-dependent bottleneck features, known as speaker embeddings. The speaker embeddings in enrolled speaker dataset are used to train a separate PLDA classifier model for the following test phase.

  • Test phase: The test utterances are evaluated with the DNN embeddings and the PLDA classifier.

In DNN-based SV systems, it is essential to construct an efficient deep embedding learning architecture for discriminating between speakers. The architecture can be bulit by time-delay neural network (TDNN) [7], [8], convolutional neural network (CNN) [9], [10] or long short-term memory network (LSTM) [11]. Among those methods, x-vector [8] using a TDNN architecture, which has superior performance than traditional i-vector, becomes the state-of-the-art DNN-based method in current SV works.

The x-vector architecture contains three typical parts: five TDNN layers for frame-level processing, a statistics pooling layer which aggregates the frame-level outputs across all frames, and two fully-connected (FC) layers for utterance-level processing. Although x-vector shows great performance on SV, the TDNN layers in x-vector limit the performance improvement. The works in [12], [13] show that exploiting information from multiple layers is beneficial for an effective representation since there exists a gradual transition from low to high layers in a DNN. However, it drastically increases computational complexity to concatenate the outputs of five high-dimensional TDNN layers in x-vector (512, 512, 512, 512, and 1500, respectively). In addition, the frequency dimension of input acoustic features is set only between 20 and 30, leading to a huge difference in dimensions between the input and the first TDNN layer, and resulting in leakage of corresponding feature information.

In recent years, to make the networks focus on important features, attention mechanisms have been gradually introduced to the DNN-based SV systems and applied to different levels of DNN architecture, such as the pooling layer [14], [15], [16], [17], the frame level [1] and the FC layer [18]. However, those works only employ the attention mechanism at a single level of the DNN and emphasize important features along a single dimension, ignoring the impacts from multiple levels and other feature dimensions.

Based on above observations, we propose an effective deep embedding learning architecture called Dilated Residual Networks with Multi-level Attention (DRES-MA for short), which combines dilated residual networks with a multi-level attention model, for the SV tasks. Inspired by traditional x-vector, we utilize one-dimensional dilated convolutional layers to capture long time-frequency context information for frame-level features. In order to exploit information from multiple layers, the residual connection [19] is employed to build the dilated residual networks. Besides, a multi-level attention model is proposed to focus on important features at multiple network levels, including the frame level and the pooling layer. At the frame level, we propose a two-dimensional convolutional block attention module (CBAM2D) to capture important features along temporal and channel axes. At the pooling layer, we introduce a vector-based attention (VA) mechanism and incorporate it into a weighted statistics pooling layer, called the VA-based pooling layer. The CBAM2D and VA-based pooling layer form the multi-level attention model, applying attention mechanisms to multiple network levels. Experiments conducted on NIST SRE 2016 datasets [20] verify the effectiveness of the proposed architecture. The DRES-MA architecture achieves an equal error rate (EER) of 7.094% and a detection cost function 2016 (DCF16) score of 0.552. Compared to the x-vector baseline, DRES-MA improves the EER by 17% and DCF16 score by 9% with little increase of computational cost.

The rest parts are organized as follows. Section 2 introduces related works of DNN-based SV methods. Section 3 describes the network construction of our proposed DRES-MA architecture. Section 4 presents the experimental setup. Experimental results and analysis of the proposed techniques are presented in Section 5. Section 6 concludes this work.

Section snippets

Related work

In recent years, with the development of deep learning and data augmentation techniques, the DNN-based SV methods have become popular and achieved competitive performance compared with traditional i-vector method. Previous works mainly improved the performance of DNN-based SV methods from three aspects: network architecture, loss function and attention mechanism.

For network architecture, the x-vector proposed by Snyder et al. [8], utilized TDNN to achieve superior performance compared with the

Overview

In this section, we propose a new deep speaker embedding architecture called Dilated Residual Networks with Multi-level Attention (DRES-MA for short), shown in Fig. 2. The proposed architecture consists of three parts: frame-level part, pooling layer and utterance-level part. In the frame-level part, we utilize 16 dilated residual blocks combined with two-dimensional convolutional block attention modules (CBAM2D) to deal with the frame-level features. At the pooling layer, the vector-based

Dataset

The training dataset consists of English telephone speech, collected from NIST speaker recognition evaluations (SRE) [34] and Switchboard datasets [35]. The test dataset is NIST 2016 speaker recognition evaluations (SRE16) [20]. The SRE training datasets consist of NIST SREs data in 2005, 2006 [36] and 2008 [37], while the Switchboard datasets are made up of Switchboard 2 Phase 1, 2, 3 and Switchboard Cellular, which contains about 28 k recordings from 2.6 k speakers. In total, there are 4343

Overall results

The experimental results with various systems are summerized in Table 3. As illustrated in Section 4.1, the SRE16 test set consists of two languages: Tagalog and Cantonese. Hence, the test set is divided into three parts: Tagalog, Cantonese and the combination of the two languages (All). Except for EER of Tagalog, the DRES-MA system achieves better performance than the three baseline systems. Especially, it achieves the best EER of 7.094% and the best DCF16 score of 0.552 in SRE16 All set.

Conclusion

In this paper, an effective deep embedding architecture DRES-MA is proposed for text-independent speaker verification, which combines dilated residual networks with a multi-level attention model. The dilated residual networks not only capture long time-frequency context information of frame-level features, but also exploit information from multiple layers. In addition, the multi-level attention model can emphasize important features at multiple levels of DNN architecture. Experimental results

CRediT authorship contribution statement

Yanfeng Wu: Conceptualization, Methodology, Software, Writing - original draft. Chenkai Guo: Conceptualization, Methodology, Writing - review & editing. Hongcan Gao: Writing - review & editing, Resources. Jing Xu: Supervision, Writing - review & editing. Guangdong Bai: Data curation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported by Science and Technology Planning Project of Tianjin, China (Grant No. 18ZXZNGX00310), Tianjin Natural Science Foundation (Grant No. 17JCZDJC30700 and 19JCQNJC00300), and Fundamental Research Funds for the Central Universities of Nankai University (Grant No. 6319140).

Yanfeng Wu is a Ph.D. candidate in the College of Artificial Intelligence of Nankai University. He received the B.E. degree from Nankai University in 2017. His research interests include speaker recognition and sound classification.

References (43)

  • T. Bian et al.

    Self-attention based speaker recognition using cluster-range loss

    Neurocomputing

    (2019)
  • S. Novoselov, A. Shulipa, I. Kremnev, A. Kozlov, V. Shchemelinin, On deep speaker embeddings for text-independent...
  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    IEEE Trans. Audio, Speech, Language Process.

    (2011)
  • S. Ioffe

    Probabilistic linear discriminant analysis

  • E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint...
  • R. Acharya, H.A. Patil, H. Kotta, Novel enhanced teager energy based cepstral coefficients for replay spoof detection,...
  • D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker...
  • D. Snyder et al.

    X-vectors: Robust dnn embeddings for speaker recognition

  • A. Torfi, J. Dawson, N.M. Nasrabadi, Text-independent speaker verification using 3d convolutional neural networks, in:...
  • Z. Gao, Y. Song, I. McLoughlin, W. Guo, L. Dai, An improved deep embedding learning method for short duration speaker...
  • L. Wan et al.

    Generalized end-to-end loss for speaker verification

  • J. Yosinski et al.

    How transferable are features in deep neural networks?

  • Y. Jiang, Y. Song, I. McLoughlin, Z. Gao, L.-R. Dai, An Effective Deep Embedding Learning Architecture for Speaker...
  • K. Okabe, T. Koshinaka, K. Shinoda, Attentive statistics pooling for deep speaker embedding, in: Proc. Interspeech...
  • Y. Zhu, T. Ko, D. Snyder, B. Mak, D. Povey, Self-attentive speaker embeddings for text-independent speaker...
  • G. Bhattacharya, M.J. Alam, V. Gupta, P. Kenny, Deeply fused speaker embeddings for text-independent speaker...
  • Y. Liu, L. He, W. Liu, J. Liu, Exploring a unified attention-based pooling framework for speaker verification, in: 2018...
  • G. Bhattacharya, J. Alam, P. Kenny, Deep speaker embeddings for short-duration speaker verification, in: Proc....
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: The IEEE Conference on Computer...
  • S.O. Sadjadi, T. Kheyrkhah, A. Tong, C. Greenberg, D. Reynolds, E. Singer, L. Mason, J. Hernandez-Cordero, The 2016...
  • Y. Tang et al.

    Deep speaker embedding learning with multi-level pooling for text-independent speaker verification

  • Cited by (17)

    • RSKNet-MTSP: Effective and portable deep architecture for speaker verification

      2022, Neurocomputing
      Citation Excerpt :

      The former extract frame-level features for each frame and generate a two-dimensional output with time dimension and channel dimension. Time-delay neural network (TDNN) [5–9] and one-dimensional convolutional neural network (CNN) along the time axis, are representative frame-level structures. The segment-level structures [10–15] consider the input acoustic features as a grayscale image which has three dimensions for time, frequency and channel, respectively, and employ two-dimensional CNN to produce three-dimensional outputs.

    • Speaker verification using attentive multi-scale convolutional recurrent network

      2022, Applied Soft Computing
      Citation Excerpt :

      The work in this paper focuses on discussing speaker verification only. Many works are done on speaker verification [10–25]. The efforts in these works mainly concentrate on two questions.

    View all citing articles on Scopus

    Yanfeng Wu is a Ph.D. candidate in the College of Artificial Intelligence of Nankai University. He received the B.E. degree from Nankai University in 2017. His research interests include speaker recognition and sound classification.

    Chenkai Guo received the Ph.D. degree from Nankai University in 2017. He is an assistant professor in the College of Computer Science, Nankai University. His research interests include software analysis on mobile apps, information security, and intelligent software engineering.

    Hongcan Gao is a Ph.D. candidate in the College of Compute r Science of Nankai University. She received the M.S. degree from Hebei University of Technology in 2017. Her research interests include software analysis on mobile apps and software security.

    Jing Xu received the Ph.D. degree from Nankai University in 2003. She is a professor in the College of Artificial Intelligence, Nankai University. She received the Second Prize of the Tianjin Science and Technology Progress Award twice, in 2017 and 2018, respectively. Currently, her research interests include intelligent software engineering and medical data analysis.

    Guangdong Bai received the bachelor’s and master’s degrees in computing science from Peking University, China, in 2008 and 2011, respectively, and the Ph.D. degree in computing science from the National University of Singapore in 2015. He is now a Senior Lecturer in the University of Queensland. His research interests include cyber security, software engineering and machine learning.

    1

    Equal contribution.

    View full text