Elsevier

Computer Speech & Language

Volume 45, September 2017, Pages 437-456
Computer Speech & Language

Fusion of auditory inspired amplitude modulation spectrum and cepstral features for whispered and normal speech speaker verification

https://doi.org/10.1016/j.csl.2017.04.004Get rights and content

Highlights

  • Speaker verification based on whispered speech while keeping performance for normal speech.

  • Three innovative features carrying invariant information for both speaking styles.

  • A score fusion scheme to show the complementarity of the proposed feature sets.

  • A new approach to extract discriminative features from modulation spectrum signal representation.

Abstract

Whispered speech is a natural speaking style that despite its reduced perceptibility, still contains relevant information regarding the intended message (i.e., intelligibility), as well as the speaker identity and gender. Given the acoustic differences between whispered and normally-phonated speech, however, speech applications trained on the latter but tested with the former exhibit unacceptable performance levels. Within an automated speaker verification task, previous research has shown that i) conventional features (e.g., mel-frequency cepstral coefficients, MFCCs) do not convey sufficient speaker discrimination cues across the two vocal efforts, and ii) multi-condition training, while improving the performance for whispered speech, tends to deteriorate the performance for normal speech. In this paper, we aim to tackle both shortcomings by proposing three innovative features, which when fused at the score level, are shown to result in reliable results for both normal and whispered speech. Overall, relative improvements of 66% and 63% are obtained for whispered and normal speech, respectively, over a baseline system based on MFCCs and multi-condition training.

Introduction

According to recent statistics, speech-based biometrics have ranked highly in costumer preference, outranking fingerprint and iris scanning solutions (O’Neil King, 2014, Markets, 2015). Due to widespread usage of smartphones worldwide, speech-based biometrics are quickly gaining popularity, particularly in financial institutions (O’Neil King, 2014). Within such applications, customers can gain access to their secure banking and insurance services by simply speaking into their phones. For financial institutions, this ease-of-use enhances customer satisfaction, whilst reducing customer care costs through increased automation rates. Costumers, on the other hand, given the flexibility of speech based communication can pose big challenges to such applications by changing, for example, their vocal effort based on the environment or the context they are in. This, together with ambient noise, has posed serious threats to speech enabled applications performance in general. Ambient noise has detrimental effects on speech based biometrics systems, particularly those trained with mel-frequency cepstral coefficients (MFCC). As an example, speaker identification accuracy as low as 7% has been reported in very noisy environments (Ming et al., 2007). As such, over the years, several speech enhancement algorithms have been proposed for environment-robust speaker recognition applications (Rao and Sarkar, 2014). Varying vocal efforts, however, have received significantly less exposure, despite its severe detrimental effects on speaker verification performance. For example, whispered-speech speaker identification accuracy as low as 20% has been reported (Grimaldi and Cummins, 2008) in clean conditions. In fact, it is highly likely that customers utilizing a mobile banking application on their smartphones will use a low vocal effort when providing sensitive information, and for the purposes of this research we are interested in whispered speech.

Here, special emphasis is placed on whispered speech because this speaking-style has gained great attention for security applications lately. With reduced perceptibility, whispered speech is a natural mode of speech production that conveys relevant and useful information for many applications. Just as normal-voiced speech, whispered speech not only conveys a message, but also traits such as identity, gender, emotional, and health states, to name a few (Lass, Waters, Tyson, 1976, Tartter, 1991, Ito, Takeda, Itakura, 2005, Chenghui, Heming, Wei, Yanlei, Min, 2009, Tsunoda, Sekimoto, Baer, 2012). As previously mentioned, whispered speech is commonly used in public situations where private or discrete information needs to be exchanged, for example, when providing a credit card number, bank account number, or other personal information. Despite the amount of information present in whispered speech, there are certain characteristics that make this speaking style challenging when presented as a possible input to speech enabled applications. As an example, the most salient characteristic of whispered speech is the lack of vocal fold vibration. Furthermore, when a person whispers, several changes occur in the vocal tract configuration, thus altering not only the excitation source, but also the syllabic rate and the general temporal dynamics characteristics of the generated speech signal (Jovicic, Saric, 2008, Ito, Takeda, Itakura, 2005). Hence, it is expected that classical methods designed for normal-voiced speech characterization will fail when tested in atypical scenarios including whispered speech (Grimaldi, Cummins, 2008, Ito, Takeda, Itakura, 2005, Fan, Hansen, 2011, Zelinka, Sigmund, Schimmel, 2012).

Despite the limited research done in this field, different approaches attempting to overcome some of these disadvantages have been reported, particularly within training/test mismatch conditions where speaker models were trained with normal speech and tested with whispered speech (Fan, Hansen, 2011, Grimaldi, Cummins, 2008, Fan, Hansen, 2013). In previous work, we found that among different feature sets and strategies evaluated, including frequency warping and alternate feature representation such as MHEC (mean Hilbert envelope coefficients) or WIF (weighted instantaneous frequencies), invariant information between normal-voiced and whispered speech is not sufficient to achieve reliable performance in speaker verification tasks for both speaking styles (Sarria-Paja and Falk, 2015). In addition to this, it was also observed that the strategies with better performance for normal-voiced speech did not exhibit the same benefits for whispered speech. Finally, it was shown the need to include data from both speaking styles in order to allow for a speaker verification system to handle both normal and whispered speech for practical applications (Sarria-Paja and Falk, 2015). It was concluded that efforts should be directed towards new feature representations aimed at reducing the impact of the addition of whispered speech during training or enrollment and to extract more efficiently speaker specific information from whispered recordings.

This paper proposes just that. Here, we propose the computation of features aiming at extracting invariant information embedded within both speaking styles. This is achieved by computing modulation spectrum based features, which in the past have been shown to accurately separate speech from environment-based components (e.g., noise and reverberation) (Falk and Chan, 2010), thus adding robustness to speaker recognition systems. We also use mutual information (MI) as an analysis measure to identify invariant information between normal-voiced and whispered speech feature pairs. MI helps to analyze both linear and non-linear statistical dependencies between the two feature sets, and has shown to be an effective way to measure relevance and redundancy among features for feature selection purposes or even characterization (Peng, Long, Ding, 2005, Estevez, Tesmer, Perez, Zurada, 2009, Clerico, Gupta, Falk, 2015). This, combined with system fusion, will help to not only reduce error rates when there are no whispered speech recordings from target speakers for enrollment, but also to reduce the observed negative impact of adding whispered speech during parameter estimation and enrollment. As an example, in speech recognition, gains in whispered speech accuracy were countered by losses in normal speech accuracy, often by the same amount (Zelinka et al., 2012). Such tradeoffs were attributed to excessive generality of the speech model (caused by large variations in the training set) and consequently reduced capability of discriminating among speech units (Zelinka et al., 2012). The new proposed system overcomes this limitation for a speaker verification task.

The remainder of this paper is organized as follows. Section 2 provides a brief background on whispered speech, emphasizing the main differences with normal speech and some approaches found in the literature related to the problem at hand. Section 3 describes the speaker verification problem, the corpus employed for speaker verification, and the baseline system characterization. Section 4 discusses different approaches and strategies to reduce the error rate in whispered speech speaker verification, presents and discusses the experimental results and the performance achieved by the proposed schemes. Lastly, Section 5 presents the conclusions.

Section snippets

Whispered speech

In the past, perceptual studies have been conducted to characterize major acoustic differences between whispered and normal-voiced speech; the lack of fundamental frequency being the most relevant difference. Nonetheless, it is not the only major change, and formant shifts towards higher frequencies (Thomas, 1969, Higashikawa, Nakai, Sakakura, Takahashi, 1996), especially lower formants (Sharifzadeh et al., 2012) have also been reported. Whispered speech also has a lower and flatter power

Automatic speaker verification (SV) system characterization

Traditionally, speaker recognition systems have been based on identity vectors (i-vectors) extraction (Dehak et al., 2011) and matching between a test utterance and a target speaker is done using either a fast scoring method based on cosine distance between i-vectors or probabilistic linear discriminant analysis (PLDA) (Sizov et al., 2014) based scoring. More recently, deep neural network (DNN) approaches have shown to be useful either during feature extraction or computing statistics by

Proposed strategies to improve system performance when testing with whispered speech

Results presented in the previous section have shown that standard SV systems have serious deficiencies when facing atypical scenarios. As such, for the task at hand, it is necessary to devise strategies aiming to compensate for the negative effects when whispered speech is considered into the possible testing scenarios. The approach taken in this work is to improve the feature representation used for i-vector extraction. From the baseline experiments it is clear that the total variability

Conclusions

This paper has addressed the issue of speaker verification (SV) based on whispered speech. Two different approaches were proposed in order to reduce error rates for SV with whispered speech while maintaining performance with normal speech. First, three innovative features were proposed: AAMF, RMFCC and LMFCC, each taking into account complementary characteristics of whispered and normal-voiced speech signals. Second, a score fusion scheme based on systems trained on the three feature sets

Acknowledgments

The authors acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Administrative Department of Science, Technology and Innovation of Colombia (COLCIENCIAS).

References (52)

  • P. Zelinka et al.

    Impact of vocal effort variability on automatic speech recognition

    Speech Commun.

    (2012)
  • A. Anjos et al.

    Bob: a free signal processing and machine learning toolbox for researchers

    Proceedings of the 20th ACM Conference on Multimedia Systems (ACMMM), Nara, Japan

    (2012)
  • L. Besacier et al.

    Subband approach for automatic speaker recognition: optimal division of the frequency domain

    Proceedings of the First International Conference on Audio- and Video-based Biometric Person Authentication

    (1997)
  • N. Brummer et al.

    The BOSARIS Toolkit User Guide: Theory, Algorithms and Code for Binary Classifier Score Processing

    Technical report

    (2011)
  • G. Chenghui et al.

    A preliminary study on emotions of chinese whispered speech

    Proceedings of the International Forum on Computer Science-Technology and Applications

    (2009)
  • A. Clerico et al.

    Mutual information between inter-hemispheric eeg spectro-temporal patterns: a new feature for automated affect recognition

    Procceedings of IEEE/EMBS NER

    (2015)
  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    IEEE Audio Speech Lang. Process.

    (2011)
  • T. Drugman et al.

    On the potential of glottal signatures for speaker recognition

    Proceedings of INTERSPEECH

    (2010)
  • P. Estevez et al.

    Normalized mutual information feature selection

    IEEE Trans. Neural Netw.

    (2009)
  • T. Falk et al.

    Modulation spectral features for robust far-field speaker identification

    IEEE Trans. Audio Speech Lang. Process.

    (2010)
  • FanX. et al.

    Speaker identification for whispered speech based on frequency warping and score competition

    Proceedings of INTERSPEECH

    (2008)
  • FanX. et al.

    Speaker identification with whispered speech based on modified LFCC parameters and feature mapping

    Proceedings of International Conference on Acoustics, Speech and Signal Processing, ICASSP

    (2009)
  • X. Fan et al.

    Acoustic analysis for speaker identification of whispered speech

    Proceedings of International Conference on Acoustics, Speech and Signal Processing, ICASSP

    (2010)
  • X. Fan et al.

    Speaker identification within whispered speech audio streams

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • L. Gallardo et al.

    Advantages of wideband over narrowband channels for speaker verification employing mfccs and lfccs

    Proceedings of INTERSPEECH

    (2014)
  • Garofolo, J. S., Consortium, L. D., et al., 1993. Timit: Acoustic-phonetic Continuous Speech...
  • Cited by (27)

    • Automated newborn cry diagnostic system using machine learning approach

      2022, Biomedical Signal Processing and Control
      Citation Excerpt :

      At the end of this process, the logarithm of the feature set is computed to reduce its massive volume [44]. For more information regarding the details of each stage, the authors suggest referring to [44]. Humans naturally use various prosodic indications to identify sounds [48].

    • Detection and classification of human-produced nonverbal audio events

      2021, Applied Acoustics
      Citation Excerpt :

      AAMFs analyze the modulation of the signal on small windowed frames of the spectrogram. They give information on the modulation bands for each acoustical frequency for each context of 200 ms as presented in [30]. Since they have a high dimensionality of 216, Principal Component Analysis (PCA) is applied to reduce the dimensionality to 40 as done in [30].

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Roger Moore.

    View full text