Fusion of auditory inspired amplitude modulation spectrum and cepstral features for whispered and normal speech speaker verification☆
Introduction
According to recent statistics, speech-based biometrics have ranked highly in costumer preference, outranking fingerprint and iris scanning solutions (O’Neil King, 2014, Markets, 2015). Due to widespread usage of smartphones worldwide, speech-based biometrics are quickly gaining popularity, particularly in financial institutions (O’Neil King, 2014). Within such applications, customers can gain access to their secure banking and insurance services by simply speaking into their phones. For financial institutions, this ease-of-use enhances customer satisfaction, whilst reducing customer care costs through increased automation rates. Costumers, on the other hand, given the flexibility of speech based communication can pose big challenges to such applications by changing, for example, their vocal effort based on the environment or the context they are in. This, together with ambient noise, has posed serious threats to speech enabled applications performance in general. Ambient noise has detrimental effects on speech based biometrics systems, particularly those trained with mel-frequency cepstral coefficients (MFCC). As an example, speaker identification accuracy as low as 7% has been reported in very noisy environments (Ming et al., 2007). As such, over the years, several speech enhancement algorithms have been proposed for environment-robust speaker recognition applications (Rao and Sarkar, 2014). Varying vocal efforts, however, have received significantly less exposure, despite its severe detrimental effects on speaker verification performance. For example, whispered-speech speaker identification accuracy as low as 20% has been reported (Grimaldi and Cummins, 2008) in clean conditions. In fact, it is highly likely that customers utilizing a mobile banking application on their smartphones will use a low vocal effort when providing sensitive information, and for the purposes of this research we are interested in whispered speech.
Here, special emphasis is placed on whispered speech because this speaking-style has gained great attention for security applications lately. With reduced perceptibility, whispered speech is a natural mode of speech production that conveys relevant and useful information for many applications. Just as normal-voiced speech, whispered speech not only conveys a message, but also traits such as identity, gender, emotional, and health states, to name a few (Lass, Waters, Tyson, 1976, Tartter, 1991, Ito, Takeda, Itakura, 2005, Chenghui, Heming, Wei, Yanlei, Min, 2009, Tsunoda, Sekimoto, Baer, 2012). As previously mentioned, whispered speech is commonly used in public situations where private or discrete information needs to be exchanged, for example, when providing a credit card number, bank account number, or other personal information. Despite the amount of information present in whispered speech, there are certain characteristics that make this speaking style challenging when presented as a possible input to speech enabled applications. As an example, the most salient characteristic of whispered speech is the lack of vocal fold vibration. Furthermore, when a person whispers, several changes occur in the vocal tract configuration, thus altering not only the excitation source, but also the syllabic rate and the general temporal dynamics characteristics of the generated speech signal (Jovicic, Saric, 2008, Ito, Takeda, Itakura, 2005). Hence, it is expected that classical methods designed for normal-voiced speech characterization will fail when tested in atypical scenarios including whispered speech (Grimaldi, Cummins, 2008, Ito, Takeda, Itakura, 2005, Fan, Hansen, 2011, Zelinka, Sigmund, Schimmel, 2012).
Despite the limited research done in this field, different approaches attempting to overcome some of these disadvantages have been reported, particularly within training/test mismatch conditions where speaker models were trained with normal speech and tested with whispered speech (Fan, Hansen, 2011, Grimaldi, Cummins, 2008, Fan, Hansen, 2013). In previous work, we found that among different feature sets and strategies evaluated, including frequency warping and alternate feature representation such as MHEC (mean Hilbert envelope coefficients) or WIF (weighted instantaneous frequencies), invariant information between normal-voiced and whispered speech is not sufficient to achieve reliable performance in speaker verification tasks for both speaking styles (Sarria-Paja and Falk, 2015). In addition to this, it was also observed that the strategies with better performance for normal-voiced speech did not exhibit the same benefits for whispered speech. Finally, it was shown the need to include data from both speaking styles in order to allow for a speaker verification system to handle both normal and whispered speech for practical applications (Sarria-Paja and Falk, 2015). It was concluded that efforts should be directed towards new feature representations aimed at reducing the impact of the addition of whispered speech during training or enrollment and to extract more efficiently speaker specific information from whispered recordings.
This paper proposes just that. Here, we propose the computation of features aiming at extracting invariant information embedded within both speaking styles. This is achieved by computing modulation spectrum based features, which in the past have been shown to accurately separate speech from environment-based components (e.g., noise and reverberation) (Falk and Chan, 2010), thus adding robustness to speaker recognition systems. We also use mutual information (MI) as an analysis measure to identify invariant information between normal-voiced and whispered speech feature pairs. MI helps to analyze both linear and non-linear statistical dependencies between the two feature sets, and has shown to be an effective way to measure relevance and redundancy among features for feature selection purposes or even characterization (Peng, Long, Ding, 2005, Estevez, Tesmer, Perez, Zurada, 2009, Clerico, Gupta, Falk, 2015). This, combined with system fusion, will help to not only reduce error rates when there are no whispered speech recordings from target speakers for enrollment, but also to reduce the observed negative impact of adding whispered speech during parameter estimation and enrollment. As an example, in speech recognition, gains in whispered speech accuracy were countered by losses in normal speech accuracy, often by the same amount (Zelinka et al., 2012). Such tradeoffs were attributed to excessive generality of the speech model (caused by large variations in the training set) and consequently reduced capability of discriminating among speech units (Zelinka et al., 2012). The new proposed system overcomes this limitation for a speaker verification task.
The remainder of this paper is organized as follows. Section 2 provides a brief background on whispered speech, emphasizing the main differences with normal speech and some approaches found in the literature related to the problem at hand. Section 3 describes the speaker verification problem, the corpus employed for speaker verification, and the baseline system characterization. Section 4 discusses different approaches and strategies to reduce the error rate in whispered speech speaker verification, presents and discusses the experimental results and the performance achieved by the proposed schemes. Lastly, Section 5 presents the conclusions.
Section snippets
Whispered speech
In the past, perceptual studies have been conducted to characterize major acoustic differences between whispered and normal-voiced speech; the lack of fundamental frequency being the most relevant difference. Nonetheless, it is not the only major change, and formant shifts towards higher frequencies (Thomas, 1969, Higashikawa, Nakai, Sakakura, Takahashi, 1996), especially lower formants (Sharifzadeh et al., 2012) have also been reported. Whispered speech also has a lower and flatter power
Automatic speaker verification (SV) system characterization
Traditionally, speaker recognition systems have been based on identity vectors (i-vectors) extraction (Dehak et al., 2011) and matching between a test utterance and a target speaker is done using either a fast scoring method based on cosine distance between i-vectors or probabilistic linear discriminant analysis (PLDA) (Sizov et al., 2014) based scoring. More recently, deep neural network (DNN) approaches have shown to be useful either during feature extraction or computing statistics by
Proposed strategies to improve system performance when testing with whispered speech
Results presented in the previous section have shown that standard SV systems have serious deficiencies when facing atypical scenarios. As such, for the task at hand, it is necessary to devise strategies aiming to compensate for the negative effects when whispered speech is considered into the possible testing scenarios. The approach taken in this work is to improve the feature representation used for i-vector extraction. From the baseline experiments it is clear that the total variability
Conclusions
This paper has addressed the issue of speaker verification (SV) based on whispered speech. Two different approaches were proposed in order to reduce error rates for SV with whispered speech while maintaining performance with normal speech. First, three innovative features were proposed: AAMF, RMFCC and LMFCC, each taking into account complementary characteristics of whispered and normal-voiced speech signals. Second, a score fusion scheme based on systems trained on the three feature sets
Acknowledgments
The authors acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Administrative Department of Science, Technology and Innovation of Colombia (COLCIENCIAS).
References (52)
- et al.
Investigation on lp-residual representations for speaker identification
Pattern Recognit.
(2009) - et al.
The NIST speaker recognition evaluation – overview, methodology, systems, results, perspective
Speech Commun.
(2000) - et al.
Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility
Speech Commun.
(2012) - et al.
Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams
Speech Commun.
(2013) - et al.
Perceived pitch of whispered vowels-relationship with formant frequencies: a preliminary study
J. Voice
(1996) - et al.
Analysis and recognition of whispered speech
Speech Commun.
(2005) - et al.
Acoustic analysis of consonants in whispered speech
J. Voice
(2008) - et al.
An overview of text-independent speaker recognition: from features to supervectors
Speech Commun.
(2010) On estimation of entropy and mutual information of continuous distributions
Signal Process.
(1989)- et al.
Brain activity in aphonia after a coughing episode: different brain activity in healthy whispering and pathological aphonic conditions
J. Voice
(2012)
Impact of vocal effort variability on automatic speech recognition
Speech Commun.
Bob: a free signal processing and machine learning toolbox for researchers
Proceedings of the 20th ACM Conference on Multimedia Systems (ACMMM), Nara, Japan
Subband approach for automatic speaker recognition: optimal division of the frequency domain
Proceedings of the First International Conference on Audio- and Video-based Biometric Person Authentication
The BOSARIS Toolkit User Guide: Theory, Algorithms and Code for Binary Classifier Score Processing
Technical report
A preliminary study on emotions of chinese whispered speech
Proceedings of the International Forum on Computer Science-Technology and Applications
Mutual information between inter-hemispheric eeg spectro-temporal patterns: a new feature for automated affect recognition
Procceedings of IEEE/EMBS NER
Front-end factor analysis for speaker verification
IEEE Audio Speech Lang. Process.
On the potential of glottal signatures for speaker recognition
Proceedings of INTERSPEECH
Normalized mutual information feature selection
IEEE Trans. Neural Netw.
Modulation spectral features for robust far-field speaker identification
IEEE Trans. Audio Speech Lang. Process.
Speaker identification for whispered speech based on frequency warping and score competition
Proceedings of INTERSPEECH
Speaker identification with whispered speech based on modified LFCC parameters and feature mapping
Proceedings of International Conference on Acoustics, Speech and Signal Processing, ICASSP
Acoustic analysis for speaker identification of whispered speech
Proceedings of International Conference on Acoustics, Speech and Signal Processing, ICASSP
Speaker identification within whispered speech audio streams
IEEE Trans. Audio Speech Lang. Process.
Advantages of wideband over narrowband channels for speaker verification employing mfccs and lfccs
Proceedings of INTERSPEECH
Cited by (27)
A Machine Learning Approach to Classify Biomedical Acoustic Features for Baby Cries
2023, Journal of VoiceShouted and whispered speech compensation for speaker verification systems
2022, Digital Signal Processing: A Review JournalAutomated newborn cry diagnostic system using machine learning approach
2022, Biomedical Signal Processing and ControlCitation Excerpt :At the end of this process, the logarithm of the feature set is computed to reduce its massive volume [44]. For more information regarding the details of each stage, the authors suggest referring to [44]. Humans naturally use various prosodic indications to identify sounds [48].
Detection and classification of human-produced nonverbal audio events
2021, Applied AcousticsCitation Excerpt :AAMFs analyze the modulation of the signal on small windowed frames of the spectrogram. They give information on the modulation bands for each acoustical frequency for each context of 200 ms as presented in [30]. Since they have a high dimensionality of 216, Principal Component Analysis (PCA) is applied to reduce the dimensionality to 40 as done in [30].
- ☆
This paper has been recommended for acceptance by Roger Moore.