Fusion of bottleneck, spectral and modulation spectral features for improved speaker verification of neutral and whispered speech
Introduction
Voice based biometric authentication, which combines mathematics and digital signal processing techniques to analyze speech characteristics associated with speaker identity, is one domain that has gained significant attention recently. Contrary to knowledge-based authentication, referred to as KBA, voice based authentication is based on who the user is, instead of what the user knows. Such technologies are burgeoning for identity management as they eliminate the need for personal identification numbers (PIN), passwords, and security questions (Unar et al., 2014). Despite the several advantages of using speech for identification purposes, challenges and unresolved problems still remain, thus hampering the widespread usage of these technologies. For example, vocal effort variations between training and testing conditions (e.g., training on normally-phonated speech, testing on whispered speech) can have severe detrimental effects on speech enabled applications such as speech and speaker recognition (Grimaldi, Cummins, 2008, Ito, Takeda, Itakura, 2005, Zelinka, Sigmund, Schimmel, 2012, Hanilci, Kinnunen, Saeidi, Pohjalainen, Alku, Ertas, 2013).
In the context of speaker verification (SV), important advances have been made to overcome this issue. Until recently, speaker verification systems were based on identity vectors (i-vectors) extracted from mel-frequency cepstral coefficient (MFCC) feature vectors with probabilistic linear discriminant analysis (PLDA) based scoring (Dehak, Kenny, Dehak, Dumouchel, Ouellet, 2011, Sizov, Lee, Kinnunen, 2014). Lately, however, deep learning approaches have shown promising results by replacing (1) the classical MFCC as acoustic features by the so-called bottleneck features (BNF) (Matejka et al., 2016) and/or (2) the Gaussian mixture model (GMM) approach to compute the necessary statistics during i-vector extraction (Lei, Scheffer, Ferrer, McLaren, 2014, Richardson, Reynolds, Dehak, Matejka, Glembek, Novotny, Plchot, Grzl, Burget, Cernocky, 2016). To enable such innovations large datasets have been required, thus covering thousands of speakers, several hours of recordings, a variety of channels and different recording sessions. Availability of such massive datasets, exist for normally-phonated, neutral speech. In turn, speech produced under other vocal efforts (e.g., shouted, whispered) are limited not only in number, but also in size. In such cases, additional compensation techniques are needed to allow for efficient use of limited non-neutral speech data without compromising the performance achieved with neutral speech. This paper addresses this important issue and explores the benefits of deep learning methods for such low-resource task.
More specifically, here special emphasis is placed on whispered speech, a natural mode of speech production conveying relevant and useful information, including speaker identity (Lass, Waters, Tyson, 1976, Tartter, 1991, Ito, Takeda, Itakura, 2005, Chenghui, Heming, Wei, Yanlei, Min, 2009, Tsunoda, Sekimoto, Baer, 2012). While different approaches have been proposed to add robustness to speaker recognition systems against environmental noise (Rao and Sarkar, 2014), limited work has been done to add robustness against varying vocal efforts, particularly for whispered speech (Grimaldi, Cummins, 2008, Fan, Hansen, 2013). Notwithstanding, some attempts have been reported in the literature to overcome the training/test mismatch problem where speaker models were trained with neutral speech and tested with whispered speech (Fan, Hansen, 2011, Grimaldi, Cummins, 2008, Fan, Hansen, 2013, Sarria-Paja, Falk, 2015). Such attempts have included: alternative feature representations, fusion at frame and scoring levels, and feature mapping, to name a few (Paja and Falk, 2015). Findings from these works included: (i) Existing features (e.g., MFCC) did not convey sufficiently reliable speaker identity information across vocal efforts, thus new features, carrying speaker dependent information embedded within both speaking styles were shown to be needed; (ii) The use of multi-style models, i.e, adding whispered speech to speaker models during training (or enrollment), has shown to be effective for whispered speech but add some negative effects when testing with normal speech, and (iii) The use of feature mapping and fusion schemes helped the mismatch problem, but affected also the performance of neutral speech SV (Sarria-Paja, Falk, 2015, Sarria-Paja, Senoussaoui, Falk, 2015, Sarria-Paja, Senoussaoui, O’Shaughnessy, Falk, 2016). As such, in order to obtain reliable SV performance for neutral and whispered speech, alternate feature representations and fusion schemes are still needed.
Existing feature extraction approaches for speaker recognition place emphasis on different aspects of the speech signal (e.g., temporal, spectral, phase), thus resulting in features that contain complementary information for the task at hand. This hypothesis has motivated researchers to explore fusion at different levels to combine the strengths and complementarity of the different features. Experimental results have shown that this strategy is effective at improving speaker verification performance in diverse scenarios (Doddington, Przybocki, Martin, Reynolds, 2000, Kinnunen, Lee, Li, 2008, Boujelbene, Mezghani, Ellouze, 2011, Khoury, El Shafey, Marcel, 2014). In previous work, we have shown that fusion is useful for normal and whispered speech speaker verification (Sarria-Paja, Senoussaoui, O’Shaughnessy, Falk, 2016, Sarria-Paja, Falk, 2017). We have shown that fusion of auditory inspired amplitude modulation spectrum and cepstral features at the score level provided reliable results for both neutral and whispered speech (Sarria-Paja and Falk, 2017). It was reported that relative improvements of 66% and 63% were obtained for whispered and neutral speech, respectively. However, fusion can also be performed at the feature level, where different feature representations can be concatenated in order to obtain an enriched representation. This can be particularly useful for data-driven approaches, such as DNNs. Here, feature- and score-level fusion strategies are explored, as are new feature representations, such as the phonetically-aware bottleneck features (Lei, Scheffer, Ferrer, McLaren, 2014, Richardson, Reynolds, Dehak, Matejka, Glembek, Novotny, Plchot, Grzl, Burget, Cernocky, 2016).
In this paper, we explore several innovations relative to our previous work (Sarria-Paja, Senoussaoui, O’Shaughnessy, Falk, 2016, Sarria-Paja, Falk, 2017). First, we explore the capabilities of DNNs to simulate the auditory system and extract invariant information across normal and whispered speech. We use previously reported research suggesting that for extracting bottleneck features, the position of the bottleneck layer has to do with the task at hand and how similar the data used for training the DNN is to the evaluation set (McLaren et al., 2016). Furthermore, perceptual studies have shown that human listeners can recognize speakers while whispering, even if they have no previous experience listening to this speech modality from target speakers (Tartter, 1991). Hence, in the absence of whispered speech data during training, it is expected that a bottleneck layer between the input and central layers should result in features more robust to changes in vocal effort. Second, we explore the fusion of these bottleneck features with spectral and amplitude modulation features, thus taking advantage of their potential complementarity. We found that combining information from MFCC variants and standard MFCC to extract bottleneck features was useful to reduce error rates for whispered speech in both matched and mismatched conditions while keeping high performance rates for neutral speech when using i-vector concatenation.
The remainder of this paper is organized as follows. Section 2 provides a background of speaker recognition problem and summarizes the feature extraction techniques. Section 3 describes the corpus employed for speaker verification, and describes the fusion approaches used in this work. Section 4, presents the results, discussion and analysis of our experiments and the performance achieved by the proposed schemes. Lastly, Section 7 presents the final conclusions.
Section snippets
i-vectors/PLDA approach
The i-vector extraction technique is currently the standard approach to map the classical frame-based representation for variable length speech recordings to a fixed length low dimensional feature vector while retaining most relevant speaker information. This approach relies on the use of a C-component Gaussian mixture model (GMM) trained as an universal background model (UBM) to partition the feature space and collect sufficient statistics, which are in turn used to extract i-vectors from
Speech datasets
For the experiments herein, an evaluation corpus using several sources of neutral and whispered speech is used. Three different databases were pooled together, the CHAINS (Characterizing Individual Speakers) speech corpus (Cummins et al., 2006), wTIMIT (whispered TIMIT) (Lim, 2011) and TIMIT (Garofolo and Consortium, 1993). The CHAINS and wTIMIT databases contain neutral and whispered speech. TIMIT only contains neutral speech. Table 1 presents details about the number of speakers and
Baseline system characterization
Table 3 reports the SV results achieved for both neutral and whispered speech under two different scenarios, namely: Scenario 1) Whispered speech was added during T matrix parameter estimation. This addition allowed to include some of the variability present in whispered speech features and improved performance of the system when testing with this speech modality. Scenario 2) The same T-matrix from Scenario 1 was used, but in this case whispered speech recordings were included in both training
Multi-Style models
In the following experiments we further explore Scenario 2 (as shown in Fig. 3(b)). In this scenario we evaluated the addition of small amounts of whispered speech from target speakers within the two fusion schemes. Two combined feature sets were used in this experiment: (i) S3, corresponding to the fusion of FBBNF and AAMF feature sets, as motivated by whispered speech results in Tables 4 and 5, and (ii) S4, corresponding to LRBNF3, LMFCC, RMFCC, and AAMF(FS), which after a pilot experiment
Discussion
While characterizing the baseline systems (Table 3) for neutral speech, it became evident that MFCC and standard BNF features achieved EER figures higher than what is typically reported in the literature (Sadjadi, Slaney, Heck, 2013, Richardson, Reynolds, Dehak, Matejka, Glembek, Novotny, Plchot, Grzl, Burget, Cernocky, 2016). This is likely due to the short speech duration which limits the phonetic variability present in the training set (Vogt, Lustri, Sridharan, 2008, Kanagasundaram, Vogt,
Conclusions
Herein, we have addressed the problem of speaker verification (SV) based on whispered speech. Different i-vector/PLDA based systems trained with deep Bottleneck Neural Network features, different short-time based feature sets were compared. Two fusion schemes (score-level and i-vector level) were implemented to overcome existing challenges observed for the task at hand, including: (i) Short duration utterances (4.5 s average), (ii) No whispered speech data during enrollment from target
Acknowledgments
The authors acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Administrative Department of Science, Technology and Innovation of Colombia (COLCIENCIAS).
References (48)
- et al.
The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective
Speech Commun.
(2000) - et al.
Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams
Speech Commun.
(2013) - et al.
Analysis and recognition of whispered speech
Speech Commun.
(2005) - et al.
Acoustic analysis of consonants in whispered speech
J. Voice
(2008) - et al.
An overview of text-independent speaker recognition: from features to supervectors
Speech Commun.
(2010) - et al.
The effects of whispered speech on state-of-the-art voice based biometrics systems
Proc. IEEE CCECE
(2015) - et al.
Automatic Speech Recognition: A Deep Learning Approach
(2014) - et al.
A statistical significance test for person authentication
Proc. Odyssey 2004: The Speaker and Language Recognition Workshop. No. EPFL-CONF-83049
(2004) - et al.
General machine learning classifiers and data fusion schemes for efficient speaker recognition
Int. J. Comput. Sci.Emerg. Technol.
(2011) - et al.
The BOSARIS Toolkit User Guide: Theory, Algorithms and Code for Binary Classifier score Processing
Technical Report
(2011)