Elsevier

Speech Communication

Volume 102, September 2018, Pages 78-86
Speech Communication

Fusion of bottleneck, spectral and modulation spectral features for improved speaker verification of neutral and whispered speech

https://doi.org/10.1016/j.specom.2018.07.005Get rights and content

Highlights

  • We have addressed the problem of speaker verification (SV) based on whispered speech.

  • The use of bottleneck features in the context of speaker verification using whispered speech.

  • Two fusion schemes were implemented to overcome existing challenges observed for the task at hand.

  • Challengue 1: Short duration utterances (4.5 s average), no whispered speech data during enrollment from target speakers.

  • Challenge 2: The negative effects observed when adding whispered speech recordings during enrollment.

  • Dedicated systems per vocal effort offer a promising solution. A neutral/whispered speech classification system was implemented.

Abstract

Speech based biometrics is becoming a preferred method of identity management amongst users and companies. Current state-of-the-art speaker verification (SV) systems, however, are known to be strongly dependent on the condition of the speech material provided as input, and can be affected by unexpected variability presented during testing, such as with environmental noise or changes in vocal effort. In this paper, SV using whispered speech is explored, as whispered speech is known to be a natural speaking style with reduced perceptibility but containing relevant information regarding speaker identity and gender. We propose to fuse information from spectral, modulation spectral and so-called bottleneck features computed via deep neural networks at the feature- and score-levels. Bottleneck features have been recently shown to provide robustness against train/test mismatch conditions and have yet to be tested for whispered speech. Experimental results showed that relative improvements as high as 79% and 60% could be achieved for neutral and whispered speech, respectively, relative to a baseline system trained with i-vectors extracted from mel frequency cepstral coefficients. Results from our fusion experiments, show that the proposed strategies allow to efficiently use the limited resources available and to result in whispered speech performance inline with that obtained with normal speech.

Introduction

Voice based biometric authentication, which combines mathematics and digital signal processing techniques to analyze speech characteristics associated with speaker identity, is one domain that has gained significant attention recently. Contrary to knowledge-based authentication, referred to as KBA, voice based authentication is based on who the user is, instead of what the user knows. Such technologies are burgeoning for identity management as they eliminate the need for personal identification numbers (PIN), passwords, and security questions (Unar et al., 2014). Despite the several advantages of using speech for identification purposes, challenges and unresolved problems still remain, thus hampering the widespread usage of these technologies. For example, vocal effort variations between training and testing conditions (e.g., training on normally-phonated speech, testing on whispered speech) can have severe detrimental effects on speech enabled applications such as speech and speaker recognition (Grimaldi, Cummins, 2008, Ito, Takeda, Itakura, 2005, Zelinka, Sigmund, Schimmel, 2012, Hanilci, Kinnunen, Saeidi, Pohjalainen, Alku, Ertas, 2013).

In the context of speaker verification (SV), important advances have been made to overcome this issue. Until recently, speaker verification systems were based on identity vectors (i-vectors) extracted from mel-frequency cepstral coefficient (MFCC) feature vectors with probabilistic linear discriminant analysis (PLDA) based scoring (Dehak, Kenny, Dehak, Dumouchel, Ouellet, 2011, Sizov, Lee, Kinnunen, 2014). Lately, however, deep learning approaches have shown promising results by replacing (1) the classical MFCC as acoustic features by the so-called bottleneck features (BNF) (Matejka et al., 2016) and/or (2) the Gaussian mixture model (GMM) approach to compute the necessary statistics during i-vector extraction (Lei, Scheffer, Ferrer, McLaren, 2014, Richardson, Reynolds, Dehak, Matejka, Glembek, Novotny, Plchot, Grzl, Burget, Cernocky, 2016). To enable such innovations large datasets have been required, thus covering thousands of speakers, several hours of recordings, a variety of channels and different recording sessions. Availability of such massive datasets, exist for normally-phonated, neutral speech. In turn, speech produced under other vocal efforts (e.g., shouted, whispered) are limited not only in number, but also in size. In such cases, additional compensation techniques are needed to allow for efficient use of limited non-neutral speech data without compromising the performance achieved with neutral speech. This paper addresses this important issue and explores the benefits of deep learning methods for such low-resource task.

More specifically, here special emphasis is placed on whispered speech, a natural mode of speech production conveying relevant and useful information, including speaker identity (Lass, Waters, Tyson, 1976, Tartter, 1991, Ito, Takeda, Itakura, 2005, Chenghui, Heming, Wei, Yanlei, Min, 2009, Tsunoda, Sekimoto, Baer, 2012). While different approaches have been proposed to add robustness to speaker recognition systems against environmental noise (Rao and Sarkar, 2014), limited work has been done to add robustness against varying vocal efforts, particularly for whispered speech (Grimaldi, Cummins, 2008, Fan, Hansen, 2013). Notwithstanding, some attempts have been reported in the literature to overcome the training/test mismatch problem where speaker models were trained with neutral speech and tested with whispered speech (Fan, Hansen, 2011, Grimaldi, Cummins, 2008, Fan, Hansen, 2013, Sarria-Paja, Falk, 2015). Such attempts have included: alternative feature representations, fusion at frame and scoring levels, and feature mapping, to name a few (Paja and Falk, 2015). Findings from these works included: (i) Existing features (e.g., MFCC) did not convey sufficiently reliable speaker identity information across vocal efforts, thus new features, carrying speaker dependent information embedded within both speaking styles were shown to be needed; (ii) The use of multi-style models, i.e, adding whispered speech to speaker models during training (or enrollment), has shown to be effective for whispered speech but add some negative effects when testing with normal speech, and (iii) The use of feature mapping and fusion schemes helped the mismatch problem, but affected also the performance of neutral speech SV (Sarria-Paja, Falk, 2015, Sarria-Paja, Senoussaoui, Falk, 2015, Sarria-Paja, Senoussaoui, O’Shaughnessy, Falk, 2016). As such, in order to obtain reliable SV performance for neutral and whispered speech, alternate feature representations and fusion schemes are still needed.

Existing feature extraction approaches for speaker recognition place emphasis on different aspects of the speech signal (e.g., temporal, spectral, phase), thus resulting in features that contain complementary information for the task at hand. This hypothesis has motivated researchers to explore fusion at different levels to combine the strengths and complementarity of the different features. Experimental results have shown that this strategy is effective at improving speaker verification performance in diverse scenarios (Doddington, Przybocki, Martin, Reynolds, 2000, Kinnunen, Lee, Li, 2008, Boujelbene, Mezghani, Ellouze, 2011, Khoury, El Shafey, Marcel, 2014). In previous work, we have shown that fusion is useful for normal and whispered speech speaker verification (Sarria-Paja, Senoussaoui, O’Shaughnessy, Falk, 2016, Sarria-Paja, Falk, 2017). We have shown that fusion of auditory inspired amplitude modulation spectrum and cepstral features at the score level provided reliable results for both neutral and whispered speech (Sarria-Paja and Falk, 2017). It was reported that relative improvements of 66% and 63% were obtained for whispered and neutral speech, respectively. However, fusion can also be performed at the feature level, where different feature representations can be concatenated in order to obtain an enriched representation. This can be particularly useful for data-driven approaches, such as DNNs. Here, feature- and score-level fusion strategies are explored, as are new feature representations, such as the phonetically-aware bottleneck features (Lei, Scheffer, Ferrer, McLaren, 2014, Richardson, Reynolds, Dehak, Matejka, Glembek, Novotny, Plchot, Grzl, Burget, Cernocky, 2016).

In this paper, we explore several innovations relative to our previous work (Sarria-Paja, Senoussaoui, O’Shaughnessy, Falk, 2016, Sarria-Paja, Falk, 2017). First, we explore the capabilities of DNNs to simulate the auditory system and extract invariant information across normal and whispered speech. We use previously reported research suggesting that for extracting bottleneck features, the position of the bottleneck layer has to do with the task at hand and how similar the data used for training the DNN is to the evaluation set (McLaren et al., 2016). Furthermore, perceptual studies have shown that human listeners can recognize speakers while whispering, even if they have no previous experience listening to this speech modality from target speakers (Tartter, 1991). Hence, in the absence of whispered speech data during training, it is expected that a bottleneck layer between the input and central layers should result in features more robust to changes in vocal effort. Second, we explore the fusion of these bottleneck features with spectral and amplitude modulation features, thus taking advantage of their potential complementarity. We found that combining information from MFCC variants and standard MFCC to extract bottleneck features was useful to reduce error rates for whispered speech in both matched and mismatched conditions while keeping high performance rates for neutral speech when using i-vector concatenation.

The remainder of this paper is organized as follows. Section 2 provides a background of speaker recognition problem and summarizes the feature extraction techniques. Section 3 describes the corpus employed for speaker verification, and describes the fusion approaches used in this work. Section 4, presents the results, discussion and analysis of our experiments and the performance achieved by the proposed schemes. Lastly, Section 7 presents the final conclusions.

Section snippets

i-vectors/PLDA approach

The i-vector extraction technique is currently the standard approach to map the classical frame-based representation for variable length speech recordings to a fixed length low dimensional feature vector while retaining most relevant speaker information. This approach relies on the use of a C-component Gaussian mixture model (GMM) trained as an universal background model (UBM) to partition the feature space and collect sufficient statistics, which are in turn used to extract i-vectors from

Speech datasets

For the experiments herein, an evaluation corpus using several sources of neutral and whispered speech is used. Three different databases were pooled together, the CHAINS (Characterizing Individual Speakers) speech corpus (Cummins et al., 2006), wTIMIT (whispered TIMIT) (Lim, 2011) and TIMIT (Garofolo and Consortium, 1993). The CHAINS and wTIMIT databases contain neutral and whispered speech. TIMIT only contains neutral speech. Table 1 presents details about the number of speakers and

Baseline system characterization

Table 3 reports the SV results achieved for both neutral and whispered speech under two different scenarios, namely: Scenario 1) Whispered speech was added during T matrix parameter estimation. This addition allowed to include some of the variability present in whispered speech features and improved performance of the system when testing with this speech modality. Scenario 2) The same T-matrix from Scenario 1 was used, but in this case whispered speech recordings were included in both training

Multi-Style models

In the following experiments we further explore Scenario 2 (as shown in Fig. 3(b)). In this scenario we evaluated the addition of small amounts of whispered speech from target speakers within the two fusion schemes. Two combined feature sets were used in this experiment: (i) S3, corresponding to the fusion of FBBNF and AAMF feature sets, as motivated by whispered speech results in Tables 4 and 5, and (ii) S4, corresponding to LRBNF3, LMFCC, RMFCC, and AAMF(FS), which after a pilot experiment

Discussion

While characterizing the baseline systems (Table 3) for neutral speech, it became evident that MFCC and standard BNF features achieved EER figures higher than what is typically reported in the literature (Sadjadi, Slaney, Heck, 2013, Richardson, Reynolds, Dehak, Matejka, Glembek, Novotny, Plchot, Grzl, Burget, Cernocky, 2016). This is likely due to the short speech duration which limits the phonetic variability present in the training set (Vogt, Lustri, Sridharan, 2008, Kanagasundaram, Vogt,

Conclusions

Herein, we have addressed the problem of speaker verification (SV) based on whispered speech. Different i-vector/PLDA based systems trained with deep Bottleneck Neural Network features, different short-time based feature sets were compared. Two fusion schemes (score-level and i-vector level) were implemented to overcome existing challenges observed for the task at hand, including: (i) Short duration utterances (4.5 s average), (ii) No whispered speech data during enrollment from target

Acknowledgments

The authors acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Administrative Department of Science, Technology and Innovation of Colombia (COLCIENCIAS).

References (48)

  • G. Chenghui et al.

    A preliminary study on emotions of Chinese whispered speech

    International Forum on Computer Science-Technology and Applications. Vol. 2

    (2009)
  • F. Cummins et al.

    The CHAINS corpus: characterizing individual speakers

    Proc of SPECOM. Vol. 6

    (2006)
  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    IEEE Audio Speech Lang. Process.

    (2011)
  • T. Falk et al.

    Modulation spectral features for robust far-field speaker identification

    IEEE Trans. Audio Speech Lang. Process.

    (2010)
  • X. Fan et al.

    Speaker identification within whispered speech audio streams

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • S. Ganapathy et al.

    Static and dynamic modulation spectrum for speech recognition

    Proc. INTERSPEECH

    (2009)
  • J.S. Garofolo et al.

    TIMIT: Acoustic-phonetic continuous speech corpus

    (1993)
  • M. Grimaldi et al.

    Speaker identification using instantaneous frequencies

    IEEE Trans. Audio Speech Lang. Process.

    (2008)
  • C. Hanilci et al.

    Speaker identification from shouted speech: analysis and compensation

    Proc. ICASSP

    (2013)
  • M.Y. Hwang et al.

    Shared-distribution hidden Markov models for speech recognition

    IEEE Trans. Speech Audio Process.

    (1993)
  • S. Irtza et al.

    Scalable i-vector concatenation for PLDA based language identification system

    Proc. APSIPA

    (2015)
  • A. Kanagasundaram et al.

    i-Vector based speaker recognition on short utterances

    Proc. INTERSPEECH

    (2011)
  • E. Khoury et al.

    Spear: an open source toolbox for speaker recognition based on Bob

    IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP)

    (2014)
  • T. Kinnunen et al.

    Dimension reduction of the modulation spectrogram for speaker verification

    Proc. The Speaker and Language Recognition Workshop (Odyssey 2008)

    (2008)
  • Cited by (0)

    View full text